Estimate parameters from epidemic curve: how to proceed?

Estimate parameters from epidemic curve: how to proceed?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am trying to build an epidemic model based upon the SIRD model, but I need to estimate the parameters for this model (α = infection rate, β = recovery rate and γ = mortality rate).

What is the procedure to estimate these parameters directly from epidemic data? For instance, I have tried to get γ from the cumulative mortality rate for COVID-19 in China; I have found that the function Holling type III (y=ax²/b²x²) looks more like the data:

As starting parameters, I used the maximum of the cumulative curve foraand a subjective number forb.

How can I find the best fitting parameters for this function based on the data? Is there a procedure in R?

Also: if this approach is correct, would the parameterbcorrespond toγ? Can I do the same forαandβ?

And: can I apply this sort of regression on the raw data instead of the cumulative curves?

Thank you

Explaining models of epidemic spreading

In an event like an epidemic, policymakers are keen to know how the disease will spread. For instance, they might be keen to know how many people are likely to be infected in future. This will help them make decisions and allocate resources towards disease control, like the number of ICUs or ventilators required in a region. These and several other predictions concerning the spread of disease are accomplished using mathematical models of epidemiology.

Mathematical models help us make our mental models more quantitative. Models are not reality however, studies across the natural and physical sciences have shown the importance of models in understanding nature. Say, we need to send a spacecraft to the Moon. To find out how much velocity a spacecraft needs for it to escape the Earth’s gravity, we would not design hundreds of spacecraft and launch them at different speeds to see which one reaches the Moon, right? Instead, we rely on mathematical equations that clearly predict the velocity and all other features that a spacecraft should have, to reach the Moon.

All said, models are not always accurate representations and they do come with limitations. To extend the analogy of the spacecraft designed to reach the Moon, if it were to reach Jupiter, our model would need some tweaks to get it there. It is important to understand the assumptions behind a model and its scope before using it to make predictions and policies.

Models are used to predict the future of the system under study. In the case of epidemics too, we need mathematical modelling to understand how the disease is most likely to spread, and where it is more likely to spread. It can be viewed as a shortcut, instead of implementing many guesses about how to deal with the spread of a disease, we can see what implementing each of these guesses would mean, using some nifty equations, and take more well-informed decisions. Even as you read this, mathematical modelling has been at the heart of several policy decisions worldwide regarding the response to CoVID-19.

Some Aspects of Mathematical Models

So, the question is, how are models developed and used? Typically, models are constructed based on some reasonable hypotheses. For instance, to construct a model for epidemics, one needs to make some assumptions about the mode of disease spread. Measles or Covid19 spread when infected individuals come in contact with healthy ones. Malaria on the other hand needs both mosquitoes and infected humans. A model for malaria would need to take into account the mosquito population as well as that of people and would differ greatly from a model for measles. To give another example of a hypothesis that goes into model formulation, one could postulate age-dependent infectivity, i.e., that the likelihood of being infected on contact, is dependent on the age of the person, with older people more likely to contract the infection. Similarly, one could postulate something about recovery (“younger people recover faster”) or reinfection (“a person who recovers is immune from reinfection for a period of two years”). All these hypotheses can be explicitly incorporated into models. These hypotheses are based on the biology of the disease, i.e., on what we know about the pathogen and the human body. Several different models for the same disease are possible, each differing in the finer details it incorporates among its hypotheses. Needless to say, models are only as good as the assumptions they are based on.

All models have parameters – numbers which can be tuned to suit the particular context to which the model is being applied. For example, consider control measures such as physical distancing. Under normal circumstances, people tend to be physically near each other, leading to a greater likelihood of being infected on proximal contact. When physical distancing is enforced, this likelihood decreases. The likelihood of infection on contact is thus a parameter when chosen to be small, it captures the situation where physical distancing is being enforced and when chosen large, it captures the business-as-usual scenario. Similarly, age demographics vary by country, state or region. If age-dependent infectivity is part of our model hypothesis, we would need parameters which keep track of the proportion of population in each age-group. These parameters would have markedly different values for India and the US, for example The Indian population is predominantly young, while that of the US is more evenly spread out across ages. Recovery rates or immunity periods are also parameters, which can be tuned differently to model different diseases. Thus, choosing parameters appropriately allows one to apply the same underlying model to different diseases, different countries or regions, under different control measure scenarios etc.

Once the broad hypotheses and parameters are chosen, the model is written down in terms of mathematical equations. These are typically differential equations, which can then be solved on a computer to obtain the quantities of interest (for eg, the number of infected people) at different points of time.

At the outset, we cannot be sure that we’ve made good choices of parameters. But we correct this slowly by comparing model predictions with actual data (as it becomes available) and tuning our parameters so that there is a close match between these. For instance, if we want to use a new model to predict the number of CoVID-19 infections in Chennai in June 2020, we first validate it using the data on infections until now. In other words, we fit our parameters such that our model is able to explain the daily number of infections until today (April 16, 2020). Once the model is validated, it can be used to predict future behaviour and suggest new experiments to study the population. As days go on and new data becomes available, it is possible to test the model predictions. In some cases, the model is improved/refined as more data becomes available and the cycle continues.

SIR and SEIR Models of Infectious Diseases

SIR models are commonly used to study the number of people having an infectious disease in a population. The model categorizes each individual in the population into one of the following three groups :

Susceptible (S) – people who have not yet been infected and could potentially catch the infection.
Infectious (I) – people who are currently infected (active cases) and could potentially infect others they come in contact with.
Recovered (R) – people who have recovered (or have died) from the disease and are thereby immune to further infections.

Cartoon showing individuals in a population categorized as S, I, R.

These compartments contain a certain number of people on each day. However, that number changes from day to day, as individuals move from one compartment to another. For instance, individuals in compartment S will move to the compartment I, if they are infected. Similarly, infected people, I will move to the recovered R compartment once they recover or die from the disease.

The total population across the three compartments (S+I+R) is assumed to remain the same at all times. This is just the total population of the country (or state/region) we are considering. This means that everyone exists in one of these 3 compartments. This ignores the fact that in the natural course of things (epidemic or not), births and deaths continue to happen in the country. But for short epidemics that last a few months, this is a reasonable assumption to make! For modelling other diseases like childhood infectious diseases, such as measles, that recur regularly, natural birth and death rates of the population will also have to be taken into account.

As in the current epidemic, from authorized sources such as ministries , one can find the numbers of active cases (I) and the number of recovered or dead (R). Also reported is the total number of infected people to date, which, if one thinks about it, is nothing but the sum I+R.

Our goal is to find out how the number of people in each compartment changes with time. In order to do that, we make two simple hypotheses on what drives the movement of people between these compartments.

Population divided into compartments, S, I and R whose numbers change with time. The total population (the sum of the populations in S, I, R) remains the same at all times.

The first hypothesis: Let us suppose you have not been infected at this point in time. So, you would belong to the S compartment. You can be exposed to the virus only when you come in contact with an infected person. The greater the number of infected people in the general population, the higher the chance that you will come in contact with an infected individual. This same principle which applies to you, applies equally to every other susceptible individual in the population. Therefore, the rate at which susceptible people become infected, i.e., the rate at which people are transferred from the S to the I compartments on a given day is proportional to the size of the I compartment as well as to the size of the S compartment on that day.

The second hypothesis: Infected people will either recover or die of the disease. On each day, a certain fraction of infected individuals will recover or die. This fraction is taken to be a constant, independent of the number of susceptible, infected, or recovered individuals on that given day. This fraction is somehow “intrinsic” to the specific pathogen and captures the average human body’s recovery time for that particular disease.

What mathematical modellers do is to write the above hypothesis in terms of mathematical equations which tell you how the number of susceptible, infected and recovered individuals change with time. In the language of mathematics, such equations are referred to as differential equations. These equations are solved by a process called integration, and these solutions will allow us to calculate, for example, the number of infected people for any time in future.

For diseases such as CoVID-19, we need to consider another compartment called ‘Exposed’ (E). This consists of individuals who might have the virus (due to travel, direct/indirect with an already positively tested person), but do not show any symptoms. For example, if your cousin travelled to Wuhan and came back she is more susceptible than you – because she has been around the virus. In other words, they are between the susceptible and infected compartments. However, despite not showing any symptoms, these (asymptomatic) individuals can still transmit the disease to susceptible individuals. One can add more compartments, for example, ‘Quarantined’ or ‘Isolated’, to better capture ongoing disease control measures. The modelling proceeds in the same way as in the previous case, with assumptions on the rates at which people move between these compartments. The solution allows us to calculate the number of infectious people at any future time.

Disease Transmission and Containment

Models enable the quantification of the spread of diseases. The rate of spread of infections in a certain population is governed by a quantity R0, called the basic reproduction number. The R0 value can be looked at as the intensity of the infectious disease outbreak. Higher the R0 value of a disease, the faster the disease would spread among the population. In simple terms, the value of R0 is equal to the number of newly infected cases, on average, an infected person will cause. The R0 for measles ranges from 12–18, depending on factors like population density and life expectancy. This shows measles that is a highly infectious disease. If one person gets it, then about 18 will follow. Compared to measles, the novel coronavirus virus is less contagious. As this virus is new, we are not conclusive, but from the evidence we have, R0 ranges from 2.2–2.6. Several biological and social factors come into play in determining the R0. The incubation period, host density, modes of transmission — all affect the R0.

Depiction of typical early stage contact tracing showing the number of people coming into contact with each infected individual. Some people become infected on contact and get transformed from S to SI. The bottommost infected person is an example of a superspreader. A more realistic example of contact tracing of the case in Korea can be seen here.

The key insight is if R0 is less than 1, then the epidemic will die out. Thus, our goal is to reduce R0. We can reduce R0 by quarantining contacts of infected people or vaccinating (if a vaccine is available). Another important way to reduce R0, especially for Covid-19, is physical distancing, i.e., maintaining a 1-2 metre distance from other people at all times. Studies show that the novel coronavirus can travel only about a meter in the air as compared to the 100 meters range for an airborne disease like measles. So, physical distancing reduces the probability of picking up an infection. This R0 value, however, is only an average estimate and can be affected by unexpected events such as community gatherings. Here, a single infected person can infect several people and is referred to as a super-spreader. For example, a single infected woman in South Korea attended a church service and ended up infecting 5176 others (as of March 18). This is why public gatherings are forbidden we do not want to even accidentally trigger the hidden super-spreaders.

Flattening the Curve

In the early stages, the disease spreads rapidly through the population. This is called the exponential growth phase. Here, the total number of infected people (I+R in our SIR model) doubles once every ‘D’ days, where the number ‘D’ for the Covid19 spread in India is around 4 (as of April 6). If this doubling rate continues unabated, the total number of infected people will grow very fast. Even though only a small fraction of those infected will need hospitalisation, our hospitals will reach capacity in a few weeks and cannot serve all those in need of care.

The higher the basic reproduction number R0, the quicker the doubling and the sooner our hospitals will be overwhelmed. If R0 is reduced by control measures (such as physical distancing), then the doubling rate slows down. While very many people will still be infected, this will occur over a longer period of time (months rather than weeks) and the daily demand for hospital beds will not outstrip supply. This stretching out of the infection curve is referred to as flattening the curve.

As days pass by, the number of active infections reaches a peak, after which the spread of infections slows down as fewer and fewer susceptible individuals remain in the population. Mathematical modelling can estimate the number of people who will be infected by the disease at any given time and how this will vary according to control measures imposed. It also gives us an estimate for when the epidemic will peak. In turn, this gives an idea of the number of hospital beds/ICUs required for the population.

Now say, we have a vaccine against an epidemic. This will reduce R0 since the number of susceptibles will decrease. As more people are vaccinated, the disease will come under control. Two Scottish mathematicians, Kermack and McKendrick (who first proposed the SIR model in 1927) showed that we do not have to vaccinate the entire population for an epidemic to die out. Vaccinating only a fraction of the population is enough and this fraction depends on the R0. This fraction for the novel coronavirus causing COVID-19 has been found to be roughly 60%. This result is another example to show how mathematical modelling is extremely useful.

Extensions and Limitations of Models

The compartment models (SIR/SEIR) can be further enhanced using more compartments, such as presymptomatic, asymptomatic, infected-quarantined, infected-recovered, dead, etc. One example is shown below.

A more sophisticated model with more compartments such as Is=Infected severely symptomatic, Ip=Infected presymptomatic, Ia=Infected asymptomatic, Im=Infected mildly symptomatic, H hospitalized, D=dead

Additionally, people can also be categorized by age, since their response to treatment in the case of COVID-19 is different. Thus, we can construct much richer models starting from the simple SIR model.

However, we notice that to find out the rates at which people from one compartment move to another, we need certain parameters. A model with bad parameters will not yield good results. The more complicated the model, the more there are parameters. These parameters vary with different countries, different regions in different countries. Further, they are affected by various factors such as migrations and non pharmaceutical intervention such as lockdowns, quarantine testing. Obtaining model parameters is a challenge and the general strategy is to look at past behavior and infer the parameters. This procedure, referred to as fitting, is widely used in mathematical modeling. However, these parameters change with time so what was for an earlier time is not valid for a later time.

Since these parameters are reflective of the behavior of humans, they are also affected by perception and information flow. For example, if there is a rumor in a community about a nearby transmission, they will be more conscious about physical interactions. Similarly, messages on various media platforms can change the behavior of people.

Infected people act differently. Some have lots of contacts whereas some have few contacts.

In addition to model parameters, the compartment models have a more fundamental limitation as illustrated here. In a SIR/SEIR model, many people fall into the susceptible compartment but not every susceptible individual has the same chance of encountering an infected individual. Healthcare workers, for instance, have more chances of getting infected. Individuals belonging to the same network (social, religious) have varying chances of getting infected depending on their network(s). For example, a shopkeeper meets hundreds of customers a day and therefore his chance of being exposed is much higher. Thus, models that account for individual behavior as opposed to behavior of a collection will capture this variation across individuals in the same compartment. This is where agent-based models (ABMs), also known as individual-based models or IBMs, become useful.

Individual or Agent-based models

Agent-based models (ABMs), also known as individual-based models(IBMs), simulate the behaviour of autonomous agents. While modelling a disease, the agents are usually individual people. This contrasts the previous model which only kept track of how the total number of susceptible, exposed, infectious and recovered patients varies with the progression of time. Thus, the previous model assumes that since each individual acts in a similar way and will have a similar chance of getting infected or transmitting infections. In contrast, ABMs treat each individual separately and their behavior can be different.

Each agent(individual) has a certain set of properties. The most relevant property is related to their infected state (S, I or R), but there are several other relevant properties. For example, the age, comorbidity, social contacts, etc. vary across individuals and will be relevant factors in the disease spread. Further, there can also be information about the spatial location. At each step(time) of the model, the individuals can change their properties depending on their neighbours. For example, S6 in the figure does not have any neighboring infected agents and will behave differently from S1 which is close to an infected agent. In more advanced models, stochasticity(randomness) and the property of learning from past actions, are incorporated, to reflect more realistic behavior.

ABM simulations, in which individual agents decide what to do in each step, overcomes certain drawbacks of the SIR models and its derivatives, such as the assumption that the population is homogenous. It is possible, for instance, to have different types of agents that represent members of different age groups or of different professions and incorporate facts such as the greater exposure of healthcare workers to infected individuals, which in turn increases their risk of infection. They serve as “bottom-up” models, in which the emergent outcome is determined by the behaviour of the individuals in the population and are more realistic methods of modelling populations. ABMs have previously been used to model diseases at multiple spatial scales, from within a city to across an entire nation. ABMs have successfully been used to model various epidemics including H1N1, various strains of influenza, and Ebola.

One major drawback to ABMs is that as the number of agents increases, so does the computational power required to run the simulation. Large-scale agent-based models tend to require high-performance computing environments for their implementation. Nevertheless, they are the state-of-the-art as far as modeling goes and can be used to model the entire population of a medium-sized city of say 5 million people.

Why Logistic Growth?

Logistic Growth is a mathe m atical function that can be used in several situations. Logistic Growth is characterized by increasing growth in the beginning period, but a decreasing growth at a later stage, as you get closer to a maximum. For example in the Coronavirus case, this maximum limit would be the total number of people in the world, because when everybody is sick, the growth will necessarily diminish.

In other use cases of logistic growth, this number could be the size of an animal population that grows exponentially until the moment where their environment does not provide enough food for all animals and hence the growth becomes slower until a maximum capacity of the environment is reached.

The reason to use Logistic Growth for modeling the Coronavirus outbreak is that epidemiologists have studied those types of outbreaks and it is well known that the first period of an epidemic follows Exponential Growth and that the total period can be modeled with a Logistic Growth.

Prism 3 -- Calculating "Unknown" Concentrations using a Standard Curve

A standard curve is a graph relating a measured quantity (radioactivity, fluorescence, or optical density, for example) to concentration of the substance of interest in "known" samples. You prepare and assay "known" samples containing the substance in amounts chosen to span the range of concentrations that you expect to find in the "unknown" samples. You then draw the standard curve by plotting assayed quantity (on the Y axis) vs. concentration (on the X axis). Such a curve can be used to determine concentrations of the substance in "unknown" samples. Prism automates this process.

Prism can fit standard curves using nonlinear regression (curve fitting), linear regression, or a cubic spline (or LOWESS) curve. To find "unknown" concentrations using a standard curve, follow these steps:

In the Welcome to Prism dialog box, select Create a new project and Work independently. Choose to format the X column as Numbers and to format the Y column for the number of replicates in your data. For our example, choose A single column of values.

Enter data for the standard curve. In our example, shown below, we entered concentrations for the "known" samples into the X column, rows 1-5, and the corresponding assay results into the Y column. Don't worry if Prism displays trailing zeros that you didn't enter--we'll change that later. Just below the standard curve values, starting in row 6, enter the assay results for the "unknown" samples into the Y column, leaving the corresponding X cells blank. Later, Prism will fit the standard curve and then report the unknown substance concentrations using that curve.

Click on the Analyze button. Choose Built-in analysis. From the Curves & Regression category, select Linear regression if you are using our example data (or if you are analyzing data that you suspect are curvilinear, choose Nonlinear regression [curve fit]). For an example of how to proceed with nonlinear data, see the example on Analyzing RIA or ELISA Data.

In the Parameters: Linear regression dialog box, check the box labeled Standard Curve X from Y, because we want our unknown concentrations to be provided. In the Output options category, select the Auto options for determining where Prism will start and end the regression line. Set the number of significant digits to 3.

When you click OK to leave the linear regression parameters dialog, Prism performs the fit and creates a results sheet.

Prism displays the results on pages called views. The default view shows the defining parameters for the curve of best fit.

Locate the drop-down box labeled "View" in the third row of the tool bar (not the "View" menu at the very top of the screen). Select Interpolated X values (in early Prism releases, this was "Standard curve X from Y").

Prism reports the corresponding X value for each unpaired Y value on your data sheet.

Add Unknowns to your Graph

Prism's automatic graph includes the data from the data sheet and the curve. To add the "unknowns" to the graph:

Switch to the Graphs section of your project.

Click on the Change button and then select Data on Graph.

The dialog box shows all data and results tables that are represented on the graph. Click on the Add button.

From the drop-down list at the top of the Add Data Sets to Graph dialog box, select . Linear regression: Interpolated X values. Click on Add, then Close. The list of data sets included on the graph should now look like the window below.

Press OK to return to the graph.

If you want the "unknowns" represented as spikes projected to the X axis (rather than data points), click on the Change button and select Symbols and Lines. From the "Data set" drop-down list, select . Linear regression: Interpolated X values. Change the symbol shape to one of the last 4 options (spikes), and set the size to 0.

Standard curve results -- Unknowns are represented as spikes on the graph numerical results are reported as an embedded table. The illustration includes some formatting changes not discussed here.

Estimating the spread rate in the current ebola epidemic

I’ve now written several articles on the West African ebola outbreak (see e.g. here, here, here, and here). This time I want to get more analytical, by describing how I estimated the ebola basic reproduction rate Ro (“R zero”), i.e. the rate of infection spread. Almost certainly, various people are making these estimates, but I’ve not seen any yet, including at the WHO and CDC websites or the few articles that have come out to date.

Some background first. Ro is a fundamental parameter in epidemiology, conceptually similar to r, the “intrinsic rate of increase”, in population biology. In epidemiology, it’s defined as the (mean) number of secondary disease cases arising from some primary case. When an individual gets infected, he or she is a secondary case relative to the primary case that infected him or her, and in turn becomes a primary case capable of spreading the disease to others. It’s a lineage in that respect, and fractal. I’ll refer to it simply as R here.

The value of R depends strongly on the biology of the virus and the behavior of the infected. It is thus more context dependent than the r parameter of population biology, which is an idealized, or optimum, rate of population growth determined by intrinsic reproductive parameters (e.g. age to reproductive maturity, mean litter size, gestation time). Diseases which are highly contagious, such as measles, smallpox and the flu, have R values in the range of 3 to 8 or even much more, whereas those that require direct exchange of body fluids, like HIV, have much lower rates.

To slow an epidemic of any disease it is necessary to lower the value of R to stop it completely, R must be brought to zero. Any value of R > 0.0 indicates a disease with at least some activity in the target population. When R = 1.0, there is a steady increase in the cumulative number of cases, but no change in the rate of infection (new cases per unit time): each infected person infects (on average) exactly 1.0 other person before either recovering or dying. Any R > 1.0 indicates a (necessarily exponential) increase in the infection rate, that is, the rate of new cases per unit time (not just the total number of cases), is increasing.

Ebola virus can apparently remain infective outside a warm body for a little while and can also spread via aerosols, but the dominant transmission mode is (by far) via direct body and body fluid contact. This fact should tend to favor generally lower R values. However, it’s also a very new disease, known only since 1976, and as far as West Africa is concerned, entirely unfamiliar–without any immune resistance in the human population. The same held true when it first showed up in Zaire and Uganda previously–it’s apparently a new disease everywhere it has shown up. Therefore, one might expect significantly higher spread rates than for a viral disease having the same basic biology but with which the population has some residual immunity.

So, on to making the estimates and considerations therein. There are four main issues here: (1) smoothing the data, (2) choosing an optimal mathematical model, (3) estimating that model’s parameter(s), and (4) estimating R from these values. I’ll take them in turn.

A complication in this situation involves the data gathering and reporting timelines, namely that the reported daily new case and death rates jump wildly from one report to the next. There is no biological reason whatsoever to expect this: it is almost certainly due to how the data are being collected and reported. For tabulating the total number of cases this is not a major issue, but for estimating rates it certainly is. One has to smooth these fluctuations out, and I’ve done this using loess, or locally weighted regression. Loess is subjective however, because you have to choose how “locally” you want the regression to be weighted.

As for issue (2), we know two important facts: (1) R > 1.0 (the number of new cases and deaths per unit time is increasing), and (2) R > 1.0 strongly implies an exponential function (a steadily but non-exponentially increasing case rate is possible but unstable, and thus unlikely over an extended period). Exponential functions are of the form y = b^ax, where b is some chosen base, usually e (

2.718) or 10, and a is the single estimated parameter.

Issue (3) necessary involves fitting the model to the data, and thus potentially involves issue (1). Should we fit the model to the raw data, or to the smoothed data (the latter being much more likely to reflect the actual biology, rather than artifacts)? If we fit it to raw data that is greatly affected by spikes in case reports, or other issues, that’s going to induce some error.

Issue (4) involves very simple math in this case, simply converting one base to another easy (but critical).

In a nutshell, this is what I did. First, I waited for the next WHO report (containing a large case spike) to come out, assuming that whenever such a large spike is shown, that this is an accurate reflection of total cases to date. I then computed the total case and death rates from the shown data, and smoothed these using a relatively stiff “loess” smoothing function, exactly as in the many graphs I’ve shown before. In the R computing language which I use, the stiffness of the resulting loess smoothing is controlled by the “span” argument, with values of

1.0 giving relatively stiff curves that are minimally influenced by the large variations in reported cases.

Eyeballing the loess smooth confirmed that it was concave over the last three months, and therefore that an exponential growth model was indeed appropriate. The standard way to proceed would be to then fit an exponential model to the data, but in the R program this is done using the nlm or optim functions, neither of which ever seems to work well (and which I was in no mood for this time, if ever). So I worked around it using a brute force approach, but one which also has some other advantages, such as the ability to evaluate the loess function’s performance.

So then, what I did was to:
(1) take the loess-estimated new case rate from the latest WHO report (Aug. 22),
(2) compute the per-day growth rate that would be required to achieve that value over the last

three months, and,(3) used that value as a starting point for systematic searches of values giving even better estimates (as evaluated via the residual sums of squares).

A key point in step (2) is that the parameter estimated (in model y = b^ax), is not in fact a, but rather the base b. It is easier (and more informative) to estimate b, the per-day growth rate of infected people in the population, as b = y^(1/x), where y is obtained from step (1), a is set to 1.0, and x is the number of days from whenever the primary cases were identified. The conversion from the population growth rate, b, to the per-person spread rate, R, is simple if we know a little about ebola disease progression, which we do.

With a starting point estimate of b, which from eyeballing I have confidence is reasonable, I then compute the sums of squares for all values of b between 0.5b and 1.5b, in 200 increments (the brute force part). Choosing the best fitting value of b, I then convert that to an estimate of R under the following logic. The value of R is time-independent, the total number of secondary cases arising from some set of primary cases. Therefore, I really just need to know how long ebola patients are infectious and the daily growth rate of ebola infection. From what I have read, patients are infectious for roughly 6 to 12 days, before either dying or recovering (they are assumed to be uninfectious (or nearly so) before this time, but to retain some infectiousness after recovery). In the absence of better data, R is therefore estimated simply by R = b^d, where d ranges from 6-12 days.

So finally, here are the results, over all cases in all countries, using a time zero of mid-May (that’s when the initial outbreak of March/April appeared to have subsided, before skyrocketing from June until now). I’ve assumed that case identification accuracy is high, i.e. that “probable” and “suspected” cases are largely correct. I’ve also assumed that spread from any animal vectors to humans is not an important component of the epidemic, once started.

The initial estimate of b was 1.042. The (maximum likelihood) rate obtained from step (3) above was very close to this: 1.043, meaning that for these data, the choice of a relatively stiff loess data smoothing parameter (span = 1.0) gave a very accurate representation of the underlying growth rate. Lastly, the estimate of R, the per capita disease transmission rate, thus ranges from 1.29 to 1.66 for d = 6 and 12 respectively, as computed from data over the last

3 months. It will be interesting to see how these rates change with time, and how they compare to estimates from past epidemics, if they’ve been made, and also between the three countries in this outbreak.

Modeling COVID-19

The global pandemic of COVID-19 has raised the profile of mathematical modeling, a core epidemiological approach to investigate the transmission dynamics of infectious diseases. Infectious disease modeling has been featured in routine briefings by the federal COVID task force, including projections of future COVID cases, hospitalizations, and deaths. Models have also been covered in the news, with stories on modeling research that has provided information into the burden of disease in the United States and globally. Along with this coverage has also come interest in and criticism of modeling, including common sources of data inputs and structural assumptions.

In this post, I describe the basics of mathematical modeling, how it has been used to understand COVID-19, and its impact on public health decision making. This summarizes the material I discussed extensively in a recent invited talk on modeling for COVID-19 global pandemic.

What Are Models?

Much of epidemiology (with many exceptions) is focused on the relationship between individual-level exposures (e.g., consumption of certain foods) and individual-level outcomes (e.g., incident cancers). Studying infectious diseases break many of these rules, due to the interest in quantifying not just disease acquisition but also disease transmission. Transmission involves understanding the effects of one’s exposures on the outcomes of other people. This happens because infectious diseases are contagious. Sir Ronald Ross, a British medical doctor and epidemiologist who characterized the transmission patterns of malaria in the early 20 th century, called these “dependent happenings.”

Dependent happenings are driven by an epidemic feedback loop, whereby the individual risk of disease is a function of the current prevalence of disease. As prevalence increases, the probability of exposure to an infected person grows. And prevalence increases with incident infections, and this is driven by individual risk related to exposure.

These dependencies create non-linearities over time, as shown in the right panel above. At the beginning of an infectious disease outbreak, there is an exponential growth curve. This may be characterized based on the doubling time in cumulative case counts. Epidemic potential can also be quantified with R0, which average number of transmissions resulting an infected individual in a completely susceptible population. The 0 in R0 refers to the time 0 in an epidemic when this would be the case colloquially, people also use R0 to discuss epidemic potential at later time points. Therefore, R0 might shrink over time as the susceptible population is depleted, or as different behavioral or biological interventions are implemented.

Mathematical models for epidemics take parameters like R0 as inputs. Models then construct the mechanisms to get from the micro-level (individual-level biology, behavior, and demography) to the macro-level (population disease incidence and prevalence). This construction depends heavily on theory, often supported by multiple fields of empirical science that provides insight into how the mechanisms (gears in the diagram below) fit together individually and together in the system.

Because of the complexity of these systems, and the wide range of mechanisms embedded, models typically synthesize multiple data streams from interdisciplinary scientific fields. Flexibility with data inputs is also important during disease outbreaks, when the availability of large cohort studies or clinical trials to explain the disease etiology or interventions with precision may be limited.

Fortunately, there are several statistical methods for evaluating the consistency of the hypothesized model against nature. Model calibration methods that test what model parameter values (e.g., values of R0) are more or less consistent with data (e.g., case surveillance of diagnosed cases). Sensitivity analyses quantify how much the final projections of a model (e.g., the effect of an infectious disease intervention) depend on the starting model inputs.

Putting these pieces together, models provide a virtual laboratory to test different hypotheses about the often complex and counterintuitive relationships between inputs and outputs. This virtual laboratory not only allows for estimation of projected future outcomes, but also testing of counterfactual scenarios for which complete data may not be available.

How Are Models Built and Analyzed?

There are many classes of mathematical models used within epidemiology. Three broad categories are: deterministic compartmental models (DCMs), agent-based models (ABMs), and network models. DCMs divide the population into groups defined, at a minimum, by the possible disease states that one could be in over time. ABMs and network models represent and simulate individuals rather than groups, and they provide a number of advantages in representing the contact processes that generate disease exposures. DCMs are the foundation of mathematical epidemiology, and provide a straightforward introduction to how models are built.

Take the example in the figure below of an immunizing disease like influenza or measles, which can be characterized by the disease states of susceptible (compartment S), infected (compartment I), and recovered (compartment R). Persons start out in S at birth, then move to I, and then to R. The flow diagram, kind of like a DAG, defines the types of transition that are hypothesized to be possible (and by an omission of arrows, which are hypothesized not). Movement from S to I corresponds to disease transmission, and the movement from I to R corresponds to recovery. There may be additional exogenous in-flows and out-flows, like those shown in the diagram, that correspond to births and deaths.

The speed at which transmission and recovery occur over time is controlled by model parameters. These flow diagrams are translated into mathematical equations that formally define this model structure and the model parameters. The following set of equations that correspond to this figure. These are differential equations that specify, on the left-hand side, how fast the sizes of the compartments change (the numerators) over time (the denominator). On the right-hand side are the definition of the set of flows in and out of each compartment.

One flow, from the S to I compartment, includes the λ (lambda) parameter that defines the “force of infection.” This is the time-varying rate of disease transmission. It varies over time for the reasons shown in the epidemic feedback loop diagram, shown above, and formalized in the equation below. The rate of disease transmission per unit of time can be defined as the rate of contact per time, c, times the probability that each contact will lead to a transmission event, t, times the probability that any contact is with an infected person. The last term is another way of expressing the disease prevalence this is the feature of the feedback loop that changes over time as the epidemic plays out.

The overall size of transitions is therefore a function of these model parameters and the total size of the compartments that the parameters apply to. In the case of disease transmission, the parameters apply to people who could become infected, or people in the S compartment. Once all the equations are built, they are programmed in a computer, such as the software tool for modeling that I built called EpiModel. To experiment with a simple DCM model, check out our Shiny app.

More complex models build out the possible disease states, for example, by adding a latently infected but un-infectious stage (called SEIR models). Or they add another transition, by adding an arrow from R back to S in the case that immunity is temporary (called SIRS models). Or they add extra stratifications, such as age groups, when those strata are relevant to the disease transmission or recovery process. By adding these stratifications, different assumptions about the contact process are then possible for example, by simulating a higher contact rate for younger persons or concentrating most of the contacts of young people with other young people. These additional model structures should be based on good theory, supported by empirical data.

How Have Models Been Used to Understand COVID-19?

Mathematical models have been used broadly in two ways in the current COVID-19 global pandemic: 1) understanding what has just happened to the world or what will soon happen 2) figuring what to do about it.

In the first category, several models have estimated the burden of disease (cases, hospitalizations, deaths) against healthcare capacity. The most famous of these models is the “Imperial College” model, led by investigators at that institution, and published online on March 16. This is an agent-based model that first projected the numbers of deaths and hospitalizations of COVID in the U.K. and the U.S. against current critical care capacity under different scenarios. In the “do nothing” scenario, in which there were no changes to behavior, the model projected 2.2 million deaths would occur in the U.S. and over 500,000 in the U.K.

The model also included scenarios of large-scale behavioral change (an example of the second category of use, what to do about it), in which different case isolation and “social distancing” (a new addition to the lexicon) measures were imposed. Under these scenarios, we could potentially “flatten the curve,” which meant reducing the peak incidence of disease relative to the healthcare system capacity. These changes were implemented in the model by changing the model parameters related to the contact rates in this case, the model structure and the contact rates were stratified by location of contacts (home, workplace, school, community) and age group.

After these models were released, the U.S. federal government substantially changed its recommendations related to social distancing nationally. There was subsequent discussion about how long these distancing measures needed to be implemented, because of the huge social and economic disruption that these changes entailed. One high-stakes policy question was whether these changes could be relaxed by Easter in mid-April or perhaps early Summer.

The Imperial College model suggests that as soon as the social distancing measures are relaxed (in the purple band) there will be a resurgence of new cases. This second wave of infection was driven by the fact that the outbreak would continue in the absence of any clinical therapy to either prevent the acquisition of disease (e.g., a vaccine) or reduce its severity (e.g., a therapeutic treatment). Particularly concerning with these incremental distancing policies would be if the second wave occurred during the winter months later this year, which would coincide with seasonal influenza.

An update to the Imperial College model was released on March 30. This model projected a much lower death toll in the U.K. (around 20,000 cases, compared to over 500,000 in the earlier model). This was interpreted by some news reports as an error in the earlier model. But instead, this revised model incorporated the massive social changes that were implemented in the U.K. and other European countries over the month of March, as shown in the figure below. Adherence to these policies were estimated to have prevented nearly 60,000 deaths during March.

This is just one of many mathematical models for COVID. Several other examples of interest are included in the resource list below. There has been an explosion of modeling research on COVID since the initial outbreak in Wuhan, China in early January. This has been facilitated by the easy sharing of pre-print papers, along with the relatively low threshold in building simple epidemic models. With this explosion of research, much of the world has become interested with modeling research as the model projections are very relevant to daily life, and fill the gap in the news coverage in advance with clinical advances in testing, treatment, and vaccine technologies. Because pre-prints have not been formally vetted in peer review, it can be challenging for non-modelers (including news reporters and public health policymakers) to evaluate the quality of modeling projections. We have seen several cases already where nuanced modeling findings have been misinterpreted or overinterpreted in the news.

As the adage by George Box goes: all models are wrong, but some are useful. This applies to mathematical models for epidemics too, including those for COVID-19. Useful models are informed by good data, and this data collection usually takes time. These data inputs for models may rapidly change as well, as was the case for the updated Imperial college model, so earlier model projections may be outdated. This does not mean that the earlier model was wrong. In one sense, models prove their utility in the absence of bad news if they stimulate public action towards prevention, which may have an effect on the shape of the future epidemic curve. In the short-term, public consumers of models may not be able to fully determine the technical quality of that research. But it is important to understand that priorities of newspapers and politicians, and what they find useful in some models, may differ substantially from strong scientific principles.


There are many resources for learning more about modeling, including my Spring Semester course at RSPH, EPI 570 (Infectious Disease Dynamics: Theory and Models). We use the textbook, An Introduction to Infectious Disease Modeling, by Emilia Vynnycky & Richard White, that provides an excellent overview of modeling basics. We also have open materials available for our summer workshop, Network Modeling for Epidemics, that focuses specifically on stochastic network models.

In addition, here is a short list of interesting and well-done COVID modeling studies:

  • Original Imperial College model:
  • Updated Imperial College model:
  • Model of social distancing in Wuhan:
  • Social distancing model for repeated episodic distancing measures:
  • Interactive model on the NY Times:
  • Age profile of the COVID epidemic:
  • Model of outbreak on the Diamond Princess cruise ship:

Samuel Jenness, PhD is an Assistant Professor in the Department of Epidemiology at the Rollins School of Public Health at Emory University. He is the Principal Investigator of the EpiModel Research Lab, where the research focuses on developing methods and software tools for modeling infectious diseases. Our primary applications are focused on understanding HIV and STI transmission in the United States and globally, as well as the intersection between infectious disease epidemiology and network science.

Estimate parameters from epidemic curve: how to proceed? - Biology

In the previous chapters, several models used in stock assessment were analysed, the respective parameters having been defined. In the corresponding exercises, it was not necessary to estimate the values of the parameters because they were given. In this chapter, several methods of estimating parameters will be analysed. In order to estimate the parameters, it is necessary to know the sampling theory and statistical inference.

This manual will use one of the general methods most commonly used in the estimation of parameters - the least squares method. In many cases this method uses iterative processes, which require the adoption of initial values. Therefore, particular methods will also be presented, which obtain estimates close to the real values of the parameters. In many situations, these initial estimates also have a practical interest. These methods will be illustrated with the estimation of the growth parameters and the S-R stock-recruitment relation.

The least squares method is presented under the forms of Simple linear Regression, multiple linear model and non linear models (method of Gauss-Newton).

Subjects like residual analysis, sampling distribution of the estimators (asymptotic or empiric Bookstrap and jacknife), confidence limits and intervals, etc., are important. However, these matters would need a more extensive course.


Consider the following variables and parameters:

Response or dependent variable

Auxiliary or independent variable

The response variable is linear with the parameters

Y = A+BX

The objective of the method is to estimate the parameters of the model , based on the observed pairs of values and applying a certain criterium function (the observed pairs of values are constituted by selected values of the auxiliary variable and by the corresponding observed values of the response variable), that is:

x i and y i for each pair i, where i=1,2. i. n

A and B and (Y 1 ,Y 2 . Y i . Y n ) for the n observed pairs of values

Object function (or criterium function)

In the least squares method the estimators are the values of A and B which minimize the object function. Thus, one has to calculate the derivatives ∂Φ/∂A e ∂Φ/∂B, equate them to zero and solve the system of equations in A and B.

The solution of the system can be presented as:

Notice that the observed values y, for the same set of selected values of X, depend on the collected sample. For this reason, the problem of the simple linear regression is usually presented in the form:

where ε is a random variable with expected value equal to zero and variance equal to σ 2 .

So, the expected value of y will be Y or A+BX and the variance of y will be equal to the variance of ε.

The terms deviation and residual will be used in the following ways:

Deviation is the difference between y observed and y mean () i.e., deviation = (y-)

Residual is the difference between y observed and Y estimated (), i.e., residual =.

To analyse the adjustment of the model to the observed data, it is necessary to consider the following characteristics:

Sum of squares of the residuals:

This quantity indicates the residual variation of the observed values in relation to the estimated values of the response variable of the model, which can be considered as the variation of the observed values that is not explained by the model .

Sum of squares of the deviations of the estimated values of the response variable of the model:

This quantity indicates the variation of the estimated values of the response variable of the model in relation to its mean , that is the variation of the response variable explained by the model .

Total sum of squares of the deviations of the observed values equal to:

This quantity indicates the total variation of the observed values in relation to the mean

It is easy to verify the following relation:

SQ total = SQ model + SQ residual

r 2 (coefficient of determination) is the percentage of the total variation that is explained by the model and

1-r 2 is the percentage of the total variation that is not explained by the model.


Consider the following variables and parameters:

Response or dependent variable

Auxiliary or independent variables

The response variable is linear with the parameters

Y = B 1 X 1 +B 2 X 2 +. + B k X k = Σ B j X j

The objective of the method is to estimate the parameters of the model , based on the observed n sets of values and by applying a certain criterium function (the observed sets of values are constituted by selected values of the auxiliary variable and by the corresponding observed values of the response variable), that is:

Observed values x 1,i x 2,i. , x j,i. , x k,i and y i for each set i, where i=1,2. i. n

Values to be estimated B 1 ,B 2 . B j . B k et (Y 1 ,Y 2 . Y i . Y n )

The estimated values can be represented by:

Object function (or criterium function)

In the least squares method the estimators are the values of B j which minimize the object function.

As with the simple linear model, the procedure of minimization requires equating the partial derivatives of Φ to zero in order to each parameter, B j , where j=1, 2. k. The system is preferably solved using matrix calculus.

Matrix X (n,k) = Matrix of the n observed values of each of the k auxiliary variables
Vector y (n,1) = Vector of the n observed values of the response variable
Vector Y (n,1) = Vector of the values of the response variable given by the model (unknown)
Vector B (k,1) = Vector of the parameters
Vector or b (k,1) = Vector of the estimators of the parameters

To calculate the least squares estimators it will suffice to put the derivative dΦ/dB of Φ in order to vector B, equal to zero. dΦ/dB is a vector with components ∂Φ/∂B 1 , ∂Φ/∂B 2 . ∂Φ/∂B k . Thus:

dΦ/dB (k,1) = -2.X T .(y-X.B) = 0

or X T y - (X T .X). B = 0

and b = = (X T .X) -1. X T y

The results can be written as:

b (k,1) = (X T .X) -1 .X T y

= X.b or = X (X T .X) -1 .X T y

residuals (n,1) = (y-)

In statistical analysis it is convenient to write the estimators and the sums of the squares using idempotent matrices. Then the idempotent matrices L, (I - L) and (I - M) with L (n,n) = X (X T. X) -1. X T , I = unity matrix and M (n,n) = mean (n,1) matrix = 1/n [1] where [1] is a matrix with all its elements equal to one, are used.

It is also important to consider the sampling distributions of the estimators assuming that the variables ε i are independent and have a normal distribution.

A summary of the main properties of the expected value and variance of the estimators is presented:

Observed response variable y

Estimator of Y of the model

6.1 - Residual Sum of squares = SQ residual (1.1) = (y-) T (y-) = y T (I-L)y

This quantity indicates the residual variation of the observed values in relation to the estimated values of the model, that is, the variation not explained by the model.

6.2 - Sum of squares of the deviation of the model = SQ model (1.1) = (-) T (-) = y T (L-M)y

This quantity indicates the variation of the estimated response values of the model in relation to the mean, that is , the variation explained by the model .

6.3 - Total Sum of the squares of the deviations = SQ total (1.1) = (y-) T (y-) = y T (I-M) y

This quantity indicates the total variation of the observed values in relation to the mean.

It is easy to verify the following relation:

SQ total = SQ model + SQ residual or

or 1 = R 2 + (1 - R 2 )

R 2 is the percentage of the total variation that is explained by the model. In matrix terms it will be:

R 2 = [y T (L - M)y].[ (y T (I - M)y] -1

1-R 2 is the percentage of the total variation that is not explained by the model.

The ranks of the matrices (I-L), (I-M) and (L-M) respectively equal to (n-k), (n-1) and (k-1), are the degrees of freedom associated with the respective sums of squares.


Consider the following variables and parameters:

Response or dependent variable

Auxiliary or independent variable

The response variable is non-linear with the parameters

Y = f(XB) where B is a vector with the components B 1 ,B 2 . B j . B k

The objective of the method is to estimate the parameters of the model, based on the n observed pairs of values and by applying a certain criterium function (the observed sets of values are constituted by selected values of the auxiliary variable and by the corresponding observed values of the response variable), that is:

Observed values x i and y i for each pair i, where i=1,2. i. n

Values to be estimated B 1 ,B 2 . B j . B k and (Y 1 ,Y 2 . Y i . Y n ) form the n pairs of observed values.

(Estimates = or b 1 ,b 2 . b j . b k and )

Object function or criterium function

The estimators will be the values of B j for which the object function is minimum.

(This criterium is called the least squares method).

It is convenient to present the problem using matrices.

Vector X (n,1) = Vector of the observed values of the auxiliary variable
Vector y (n,1) = Vector of the observed values of the response variable
Vector Y (n,1) = Vector of the values of the response variable given by the model
Vector B (k,1) = Vector of the parameters
Vector b (k,1) = Vector of the estimators of the parameters

In the case of the non linear model, it is not easy to solve the system of equations resulting from equating the derivative of the function Φ in order to the vector B, to zero. Estimation by the least squares method can, based on the Taylor series expansion of function Y, use iterative methods.

Revision of the Taylor series expansion of a function

Here is an example of the expansion of a function in the Taylor series in the case of a function with one variable.

The approximation of Taylor means to expand a function Y = f(x) around a selected point, x 0, in a power series of x:

Y = f(x) = f(x 0 ) +(x-x 0 ).f’(x 0 )/1! + (x-x 0 ) 2 f’’(x 0 )/2! +. + (x- x 0 ) i f (i) (x 0 )/i!+.

f (i) (x 0 ) = i th derivatives of f(x) in order to x, at the point x 0 .

The expansion can be approximated to the desired power of x. When the expansion is approximated to the power 1 it is called a linear approximation, that is,

The Taylor expansion can be applied to functions with more than one variable. For example, for a function Y = f(x 1 ,x 2 ) of two variables, the linear expansion would be:

which may be written, in matrix notation, as

where Y (0) is the value of the function at the point x (0) ,with components x 1(0) and x 2(0) ,and A (0) is the matrix of derivatives whose elements are equal to the partial derivatives of f(x 1 ,x 2 ) in order to x 1 ,x 2 at the point (x 1(0) , x 2(0) ).

To estimate the parameters, the Taylor series expansion of function Y is made in order to the parameters B and not to the vector X.

For example, the linear expansion of Y = f(x,B) in B 1 , B 2 . B k , would be:

Y = f(xB) = f(x B (0) ) + (B 1 -B 1(0) ) f /B 1 (xB (0) ) +. +
(B 2 -B 2(0) )f /B 2 (xB (0) ) +. +. + (B k -B k(0) ) f /B k (xB (0) )

or, in matrix notation, it would be:

A = matrix of order (n,k) of the partial derivatives of the matrix f(xB) in order to the vector B at the point B (0) and

Then, the object function will be:

To obtain the minimum of this function it is more convenient to differentiate Φ in order to the vector ΔB than in relation to vector B and put it equal to zero . Thus:

If ΔB (0) is "equal to zero" then the estimate of B is equal to B (0) .

(In practice, when we say "equal to zero" in this process, we really mean smaller than the approximation vector one has to define beforehand).

If ΔB (0) is not "equal to zero" then the vector B (0) will be replaced by:

And the process will be repeated, that is, there will be another iteration with B (0) replaced by B (1) (and A (0) replaced by A (1) ). The iterative process will go on until the convergence at the desired level of approximation is reached.

1. It is not guaranteed that the process always converges. Sometimes it does not, some other times it is too slow (even for computers!) and some other times it converges to another limit!!

2. The above described method is the Gauss-Newton method which is the basis of many other methods. Some of those methods introduce modifications in order to obtain a faster convergence like the Marquardt method (1963), which is frequently used in fisheries research. Other methods use the second order Taylor expansion (Newton-Raphson method), looking for a better approximation. Some others, combine the two modifications.

3. These methods need the calculation of the derivatives of the functions. Some computer programs require the introduction of the mathematical expressions of the derivatives, while others use sub-routines with numerical approximations of the derivatives.

4. In fisheries research, there are methods to calculate the initial values of the parameters, for example in growth, mortality, selectivity or maturity analyses.

5. It is important to point out that the convergence of the iterative methods is faster and more likely to approach the true limit when the initial value of the vector B (0) is close to the real value.


The least squares method (non-linear regression) allows the estimation of the parameters K, L ∞ and t o of the individual growth equations.

The starting values of K, L ∞ and t 0 for the iterative process of estimation can be obtained by simple linear regression using the following methods:

Ford-Walford (1933-1946) and Gulland and Holt (1959) Methods

The Ford-Walford and Gulland and Holt expressions, which were presented in Section 3.4, are already in their linear form, allowing the estimation of K and L ∞ with methods of simple linear regression on observed L i and T i . The Gulland and Holt expression allows the estimation of K and L ∞ even when the intervals of time T i are not constant. In this case, it is convenient to re-write the expression as:

Stamatopoulos and Caddy Method (1989)

These authors also present a method to estimate K, L ∞ and t o (or L o ) using the simple linear regression. In this case the von Bertalanffy equation should be expressed as a linear relation of L t against e -Kt .

Consider n pairs of values t i , L i where t i is the age and L i the length of the individual i where i=1,2. n.

The von Bertalanffy equation, in its general form is (as previously seen):

The equation has the simple linear form, y = a + bx, where:

If one takes L a = 0, then t a =t o , but, if one considers t a = 0, then L a = L o .

The parameters to estimate from a and b will be L ∞ , t o or L o .

The authors propose adopting an initial value K (0) , of K, and estimating a (0) , b (0) and r 2 (0) by simple linear regression between y (= L t ) and x(=e k (0) ). The procedure may be repeated for several values of K, that is, K (1) K (2) . One can then adopt the regression that results in the larger value of r 2 , to which K max , a max and b max correspond. From the values of a max , b max and K max one can obtain the values of the remaining parameters.

One practical process towards finding K max can be:

(i). To select two extreme values of K which include the required value, for example K= 0 and K=2 (for practical difficulties, use K = 0.00001 instead of K = 0).

(ii). Calculate the 10 regressions for equally-spaced values of K between those two values in regular intervals.

(iii). The corresponding 10 values of r 2 will allow one to select two new values of K which determine another interval, smaller than the one in (i), containing another maximum value of r 2 .

(iv). The steps (ii) and (iii) can be repeated until an interval of values of K with the desired approximation is obtained. Generally, the steps do not need many repetitions.


Several methods were proposed to estimate M, and they are based on the association of M with other biological parameters of the resource. These methods can produce approximate results.


Longevity : Maximum mean age t λ of the individuals in a non-exploited population.

Duration of the exploitable life : t λ - t r = λ (Figure 7.1)

Figure 7.1 Duration of the exploitable life

Tanaka (1960) proposes "NATURAL" Survival Curves (Figure 7.2) to obtain the values of M from longevity.

A cohort practically vanishes when only a fraction, p, of the recruited individuals survives. In that case, N λ = R · e -M·λ , and it can be written:

Different values of the survival fraction produce different survival curves of M in function of λ.

Figure 7.2 Survival curves by Tanaka

Any value of p can be chosen, for instance, p = 5%, ( i.e. one in each twenty recruits survives until the age t λ ) as variable value of the survival curves.


Beverton and Holt Method (1959)

Gulland (1969) mentions that Beverton and Holt verified that species with a larger mortality rate M also presented larger values of K. Looking for a simple relation between these two parameters, they concluded approximately that:

Based on the following considerations:

1. Resources with a high mortality rate cannot have a very big maximum size
2. In warmer waters, the metabolism is accelerated, so the individuals can grow up to a larger size and reach the maximum size faster than in colder waters.

Based on data of 175 species, Pauly adjusted multiple linear regressions of transformed values of M against the corresponding transformed values of K, L ∞ and temperature, T, and selected one that was considered to have a better adjustment, that is, the following empirical relation:

with the parameters expressed in the following units:

M = year -1
L ∞ = cm of total length
K = year -1
T° = surface temperature of the waters in °C

Pauly highlights the application of this expression to small pelagic fishes and crustaceans. The Pauly relation uses decimal logarithms to present the first coefficient different from the value -0.0152 which was given in the previous expression, written with natural logarithms.


Rikhter and Efanov Method (1976)

These authors analysed the dependency between M and the age of first (or 50 percent) maturity. They used data from short, mean and long life species, and suggested the following relation of M with the, t mat , age of 1 st maturity:

Based on the assumption that the natural mortality rate should be related to the investment of the fish in reproduction, beyond the influence of other factors, Gundersson established several relations between M and those factors.

He proposed, however, the following simple empirical relation, using the Gonadosomatic Index (GSI) (estimated for mature females in the spawning period) in order to calculate M:


The natural mortality coefficients M i , at age i can be calculated from the catch, C i , in numbers, and the survival numbers, N i and N i+1 at the beginning and end of a year, by following the steps:

The several values of M obtained in each age could be combined to calculate a constant value, M, for all ages.

Let us consider the supposition that F i is proportional to f i for several years i, that is

So, the linear regression between Z i and f i has a slope b = q and an intercept a = M.


There are several methods of estimating the total mortality coefficient, Z, assumed to be constant during a certain interval of ages or years.

It is convenient to group the methods, according to the basic data, into those using ages or those using lengths.


The different methods are based on the general expression of the number of survivors of a cohort, at the instant t, submitted to the total mortality, Z, during an interval of time, that is:

Z is supposed to be constant in the interval of time (t a ,t b ).

Taking logarithms and re-arranging the terms, the expression will be:

where Cte is a constant (= ln N a +Zt a ).

This expression shows that the logarithm of the number of survivors is linear with the age, being the slope equal to -Z.

Any constant expression which does not affect the determination of Z will be referred to as Cte.

1. If Z can be considered constant inside the interval (t a ,t b ) and, having available abundance data, N i, or indices of abundance in number, U i in several ages, i, then, the application of the simple linear regression allows one to estimate the total mortality coefficient Z.

The simple linear regression between and t i allows the estimation of Z (notice that the constant, Cte is different from the previous one. In this case only the slope matters to estimate Z).

2. If ages are not at constant intervals, the expression could be approximated and expressed in terms of the t centrali. For T i variable, it will be:

ln N i ≈ Cte - Z. t centrali

3. When using indices U i , the situation is similar because U i = q. N i , with q constant, and then, also:

The simple linear regression between and t i allows one to estimate Z.

4. If the intervals are not constant, the expression should be modified to:

Simple linear regression can be applied to obtain Z, from catches, C i , and ages, t i, supposing that F i is constant.

and so, when T i is constant. So:

5. If the intervals are not constant, the expression should be modified to:

6. Let V i be the cumulative catch from t i until the end of the life, then:

Where the sum goes from the last age until age i,

As F k and Z k are supposed to be constant ΣN kcum = N i /Z and so:

7. Following Beverton and Holt (1956), Z can be expressed as:

Then, it is possible to estimate Z from the mean age t

This expression was derived, considering the interval (t a , t b ) as (t a , ∞).


When one has available data by length classes instead of by age, the methods previously referred to can still be applied. For that purpose, it is convenient to define the relative age.

Using the von Bertalanffy equation one can obtain the age t in function of the length, as:

(the expression is written in the general form in relation to t a and not to t 0 )

(This equation is referred to by some authors as the inverse von Bertalanffy equation).

The difference t-ta is called relative age, t * ,.

So: t * =-(1/K).ln[(L ∞ - L t )/(L ∞ - L a )] or t * =-(1/K)ln[1-(L t -L a )/ (L ∞ - L a )]

t * is called a relative age because the absolute ages, t, are related to a constant age, t a .

In this way, the duration of the interval T i can either be calculated by the difference of the absolute ages or by the difference of the relative ages at the extremes of the interval:

T i = t i+1 -t i = t * i +1 - t * i

So, the previous expressions still hold when the absolute ages are replaced by the relative ages:

ln N i = Cte - Z. t * centrali
ln U i = Cte - Z. t * centrali
ln V i = Cte - Z. t * i
ln C i /T i = Cte - Z. t * centrali

Finally, the expression would also be:

Beverton and Holt (1957) proved that:

must be calculated as the mean of the lengths weighted with abundances (or their indices) or with the catches in numbers.

1. The application of any of these methods must be preceeded by the graphical representation of the corresponding data, in order to verify if the assumptions of the methods are acceptable or not and also to determine the adequate interval, (t a, t b ).

2. These formulas are proved with the indications that were presented, but it is a good exercise to develop the demonstrations as they clarify the methods.

3. It is useful to estimate a constant Z, even when it is not acceptable, because it gives a general orientation about the size of the values one can expect.

4. The methods are sometimes referred to by the names of the authors. For example, the expression ln V i = Cte - Z.t * i is called the Jones and van Zalinge method (1981).

5. The mean age as well as the mean length in the catch can be calculated from the following expressions:

with C i = catch in number in the age class i

where C i = catch in number in the length class i

with C i = catch in number in the age class.

The relative age should be t * = - (1/K).ln[(L ∞ - L t )/(L ∞ - L a )]

Summary of the Methods to Estimate the Total Mortality Coefficient, Z

Assumption: Z is constant in the interval of ages, (t a , t b )

(t b = ∞) (Beverton and Holt equation of Z )

Supposition: Z is constant in the interval of lengths, (L a , L b )

(Gulland and Holt equation)

(Jones and van Zalinge equation)

(Beverton and Holt equation of Z )


The least squares method (non-linear model) can be used to estimate the parameters, α and k, of any of the S-R models.

The initial values of the Beverton and Holt model (1957) can be obtained by re-writing the equation as:

and estimating the simple linear regression between y (= S/R) and x (=S) which will give the estimations of 1/α and 1/(αk). From these values, it will then be possible to estimate the parameters α and k. These values can be considered as the initial values in the application of the non-linear model.

In the Ricker model (1954) the parameters can be obtained by re-writing the equation as:

and applying the simple linear regression between y (= ln R/S) and x (=S) to estimate ln α and (-1/k). From these values, it will be possible to estimate the parameters (α and k) of the model, which can be considered as the initial values in the application of the non-linear model.

It is useful to represent the graph of y against x in order to verify if the marked points are adjustable to a straight line before applying the linear regression in any of these models.

In the models with the flexible parameter, c, like for example, the Deriso model (1980), the equation can be re-written as:

For a given value of c the linear regression between y (= (R/S) c ) and x (=S) allows the estimation of the parameters α and k.

One can try several values of c to verify which one will have a better adjustment with the line y against x for example, values of c between -1 and 1.

The values thus obtained for α, k and c, can be considered as initial values in the application of the iterative method, to estimate the parameters α, k and c of the non-linear Deriso model.



The cohort analysis is a method to estimate the fishing mortality coefficients, F i , and the number of survivors, N i , at the beginning of each age, from the annual structures of the stock catches, in number, over a period of years.

More specifically, consider a stock where the following is known:

age, i, where i = 1,2. k
year, j, where j = 1,2. n
Matrix of catches [C] with
C i,j = Annual catch, in number, of the individuals with the age i and during the year j
Matrix of natural mortality [M] with
M i,j = natural mortality coefficient, at the age i and in the year j.
Vector [T] where
T i = Size of the age interval i (in general, T i =T=1 year)

In the resolution of this problem, it is convenient to consider these estimations separately one interval of age i (part 1) all the ages during the life of a cohort (part 2) and finally, all the ages and years (part 3).

Consider that the following characteristics of a cohort, in an interval T i are known:

C i = Catch in number
M i = Natural mortality coefficient
T i = Size of the interval

Adopting a value of F i , it is then possible to estimate the number of survivors at the beginning, N i , and at the end, N i+1, of the interval.

In fact, from the expression:

one can calculate N i which is the only unknown variable in the expression.

To calculate N i+1 one can use the expression where the values N i , F i and M i were previously obtained.

Suppose now that the catches C i of each age i, of a cohort during its life, the values of M i and the sizes of the interval T i are known.

Adopting a certain value, F final, for the Fishing Mortality Coefficient in the last class of ages, it is possible, as mentioned in part 1, to estimate all the parameters (related to numbers) in that last age group . In this way, one will know the number of survivors at the beginning and end of the last age.

The number at the beginning of that last class of ages, is also the number N last at the end of the previous class, that is, N final is the initial number of survivors of the class before last.

Using the C i expression, resulting from the combination of the two expressions above:

one can estimate F i in the previous class, which is the only unknown variable in the expression. The estimation may require iterative methods or trial and error methods .

Finally, to estimate the number N i of survivors at the beginning of the class i, the following expression can be used:

Repeating this process for all previous classes, one will successively obtain the parameters in all ages, until the first age.

In the case of a completely caught cohort, the number at the end of the last class is zero and the catch C has to be expressed as:

Pope (1972) presented a simple method to estimate the number of survivors at the beginning of each age of the cohort life, starting from the last age.

It is enough to apply successively in a backward way, the expression:

Pope indicates that the approximation is good when MT ≤ 0.6

Pope’s expression is obtained, supposing that the catch is made exactly at the central point of the interval T i (Figure 7.3).

Figure 7.3 Number of survivors during the interval T i = t i+1 - t i with the catch extracted at the central point of the interval

Proceeding from the end to the beginning one calculates successively:

substituting N’ by N"+C i , the expression will be:

Finally, substituting N" by N i+1 .e +MTi/2 , it will be:

Let us suppose now that the Catch matrix [C], the natural mortality [M] matrix and the vector size of the intervals [T], are known for a period of years.

Let us also assume that the values of F in the last age of all the years represented in the matrices and the values of F of all the ages of the last year were adopted. These values will be designated by F terminal (Figure 7.4)

Figure 7.4 Matrix of catch, [C], with F terminal in the last line and in the last column of the matrix C. The shadowed zones exemplify the catches of a cohort

Notice that in this matrix the elements of the diagonal correspond to values of the same cohort, because one element of a certain age and a certain year will be followed, in the diagonal, by the element that is a year older.

From parts 1 and 2 it will then be possible to estimate successively Fs and Ns for all the cohorts present in the catch matrix.

1. The values of M i,j are considered constant and equal to M, when there is no information to adopt other values.

2. When data is referred to ages, the values T i will be equal to 1 year.

3. The last age group of each year is, sometimes grouped ages(+). The corresponding catches are composed of individuals caught during those years, with several ages. So, the cumulative values do not belong to the same cohorts, but are survivors of several previous cohorts with different recruitments and submitted to different fishing patterns. It would not be appropriate to use the catch of a group (+) and to apply cohort analysis. Despite this fact, the group (+) is important in order to calculate the annual totals of the catches in weight, Y, of total biomasses, B, and the spawning stock biomass. So, it is usual to start with the cohort analysis on the age immediately before the group (+) and use the group (+) only to calculate the annuals Y, B and (SP). The value of F in that group (+) in each year, can be estimated as being the same fishing mortality coefficient as the previous age or, in some cases, as being a reasonable value in relation to the values of F i in the year that is being considered.

4. A difficulty in the technical application appears when the number of ages is small or when the years are few. In fact, in those cases, the cohorts have few age classes represented in the Matrix [C] and the estimations will be very dependent on the adopted values of F terminals .

5. The cohort analysis (CA) has also been designated as: VPA (Virtual Population Analysis), Derzhavin method, Murphy method, Gulland method, Pope method, Sequential Analysis, etc. Sometimes, CA is referred to when the Pope formula and the VPA are used in other cases. Megrey (1989) presents a very complete revision about the cohort analyses.

6. It is also possible to estimate the remaining parameters in an age i , related to numbers, that is, N cumi , N i , D i , Z i and E i . When the information on initial individual or mean weights matrices [w] or [w] are available, one can also calculate the matrices of annual catch in weight [Y], of biomasses at the beginning of the years, [B], and of mean biomasses during the years [B]. If one has information on maturity ogives in each year, for example at the beginning of the year, spawning biomasses [SP] can also be calculated. Usually, only the total catches Y, the stock biomasses (total and spawning) at the beginning and the mean biomasses of the stock (total and spawning) in each year are estimated.

7. The elements on the first line of the matrix [N] can be considered estimates of the recruitment to the fishery in each year.

8. The fact that the F terminals are adopted and that these values have influence on the resulting matrix [F] and matrix [N], forces the selection of values of F terminals to be near the real ones. The agreement between the estimations of the parameters mentioned in the points 6. and 7. and other independent data or indices (for example, estimations by acoustic methods of recruitment or biomasses, estimations of abundance indices or cpue´s, of fishing efforts, etc) must be analysed.

9. The hypothesis that the exploitation pattern is constant from year to year, means that the fishing level and the exploitation pattern can be separated, or F sepi = F j x s i . This hypothesis can be tested based on the matrix [ F ] obtained from the cohort analysis.

It is usual to call this separation VPA-Separable (SVPA).

Then, if F ij = F j .s i one can prove that .

If the estimated values of F ij are the same as the previous Fsep ij = F j .s i then the hypothesis is verified. This comparison can be carried out in two different ways, the simplest is to calculate the quotients Fsep ij /F ij . If the hypothesis is true this quotient is equal to one. If the hypothesis is not verified it is always possible to consider other hypotheses with the annual vector [s] constant in some years only, mainly the last years.

10. It is usual to consider an interval of ages, where it can be assumed that the individuals caught are "completely recruited". In that case, the interval of ages corresponds to exploitation pattern constant (for the remaining ages, not completely recruited, the exploitation pattern should be smaller). For that interval of ages, the means of the values of F i,j in each year are then calculated. Those means, F j , are considered as fishing levels in the respective years. The exploitation pattern in each cell, would then be the ratio F i,j / F j. The s i, for the period of years considered, can be taken as the mean of the relative pattern of exploitation calculated before. Alternatively, they can also be taken as referring to s i of an age chosen for reference.


The technique of the cohort ansalysis, applied to the structure of the catches of a cohort during its life, can be made with non constant intervals of time, T i ,. This means that the length classes structure of the catches of a cohort during its life, can also be analysed.

The methods of analysis of the cohort in those cases is called the LCA ( L ength C ohort A nalysis). The same techniques Pope method, iterative method, etc, of the CA for the ages, can be applied to the LCA analysis (the intervals T i ´s can be calculated from the relative ages).

One way to apply the LCA to the length annual catch compositions, will be: to group the catches of length classes belonging to the same age interval in each year. The technique CA can then be applied directly to the resulting age composition of the catches by age of the matrix [C]. This technique is known as "slicing" the length compositions. To "slice", one usually inverts the von Bertalanffy length growth equation and estimates the age t i for each length L i (sometimes using the relative ages t * i ) (Figure 7.5). It is possible that when grouping the length classes of the respective age interval, there are length classes composed by elements that belong to two consecutive age groups. In these cases, it will be necessary to "break" the catch of these extreme classes into two parts and distribute them to each of those ages. In the example of Figure 7.5, the catches of the length class (24-26] belong to age 0 and to age 1. So, it is necessary to distribute that catch to the two ages. One simple method is to attribute to age 0 the fraction (1.00 - 0.98)/(1.06 - 0.98) = 0.25 of the annual catch of that length class and to age 1 the fraction (1.06 - 1.00)/(1.06 - 0.98) = 0.75. The method may not be the most appropriate one, because it is based on the assumption that, in the length classes, the distribution of the individuals by length is uniform. So, it is necessary to use the smallest possible interval of length classes, when applying this distribution technique.

Another way to do the length cohort analysis is to use the catches in the length classes of the same age group. It is possible to follow the cohorts in the matrix [C], through the length classes belonging to a same age, in a certain year, with the length classes of the next age, in the following year, etc. In this way, the different cohorts existing in the matrix will be separated and the evolution of each one of them will be by length classes, not by age (see Figure 7.5).

Figure 7.5 Example of a matrix [C] with the catches of the cohort shadowed, written in bold, recruited at year 2000, "sliced" by length classes,

Design Ground Motions

Engineers should typically use the tools below for seismic design the parameter values they provide are not typically identical to those from hazard tools available elsewhere on the USGS website.

The USGS collaborates with organizations that develop building codes (for buildings, bridges, and other structures) to make seismic design parameter values available to engineers. The design code developers first decide how USGS earthquake hazard information should be applied in design practice. Then, the USGS calculates values of seismic design parameters based on USGS hazard values and in accordance with design code procedures.

U.S. Seismic Design Maps Web Services

Due to insufficient resources and the recent development of similar web tools by third parties, the USGS has replaced its former U.S. Seismic Design Maps web applications with web services that can be used through third-party tools. Your options for using the replacement USGS web services, which still provide seismic design paramter values from numerous design code editions, are:

Third-party Graphical User Interfaces (GUIs)

Most users obtain seismic design parameter values from the USGS web services through third-party GUIs like the following:

USGS U.S. Seismic Design Maps Web Services

It is possible, but less convenient, to obtain seismic design parameter values directly from the USGS web services:

Risk-Targeted Ground Motion Calculator

This web tool calculates risk-targeted ground motion values from probabilistic seismic hazard curves in accordance with the site-specific ground motion procedures defined in “Method 2” (Section of the ASCE/SEI 7-10 and 7-16 standards. The vast majority of engineering projects in the U.S. will require use of the U.S. Seismic Design Maps Web Services (see above) rather than this Risk-Targeted Ground Motion Calculator.

Seismic Design Data Sets (Text file format)

Seismic Design Maps (PDF format)

Seismic design maps from various design code reference documents are available below as PDF files.

It seems like there are three questions here:

Is the actual distribution of cases Gaussian? No.

Are the curves given in the graphic Gaussian? Not quite. I think the red one is a little bit skewed, and the blue one is definitely skewed.

Can plots of a value versus time be considered Gaussian? Yes.

In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the form $f(x) = ae^<-><2c^<2>>>>$ for arbitrary real constants a, b and non zero c.

There is no requirement that it be a probability distribution.

Not in the sense of a Gaussian probability distribution: the bell-curve of a normal (Gaussian) distribution is a histogram (a map of probability density against values of a single variable), but the curves you quote are (as you note) a map of the values of one variable (new cases) against a second variable (time). (@Accumulation and @TobyBartels point out that Gaussian curves are mathematical constructs that may be unrelated to probability distributions given that you are asking this question on the statistics SE, I assumed that addressing the Gaussian distribution was an important part of answering the question.)

The possible values under a normal distribution extend from $-infty$ to $infty$ , but an epidemic curve cannot have negative values on the y axis, and traveling far enough left or right on the x axis, you will run out of cases altogether, either because the disease is does not exist, or because Homo sapiens does not exist.

Normal distributions are continuous, but the phenomena epidemic curves measure are actually discrete not continuous: they represent new cases during each discrete unit of time. While we can subdivide time into smaller meaningful units (to a degree), we eventually run into the fact that individuals with new infections are count data (discrete).

Normal distributions are symmetric about their mean, but despite the cartoon conveying a useful public health message about the need to flatten the curve, actual epidemic curves are frequently skewed to the right, with long thin tails as shown below.

Normal distributions are unimodal, but actual epidemic curves may feature one or more bumps (i.e. may be multi-modal, they may even, as in @SextusEmpiricus' answer, be endemic where they return cyclically).

Finally, here is an epidemic curve for COVID-19 in China, you can see that the curve generally diverges from the Gaussian curve (of course there are issues with the reliability of the data, given than many cases were not counted):

You can use scipy.optimize.curve_fit : This method does not only return the estimated optimal values of the parameters, but also the corresponding covariance matrix:

Optimal values for the parameters so that the sum of the squared residuals of f(xdata, *popt) - ydata is minimized

The estimated covariance of popt. The diagonals provide the variance of the parameter estimate. To compute one standard deviation errors on the parameters use perr = np.sqrt(np.diag(pcov)).

How the sigma parameter affects the estimated covariance depends on absolute_sigma argument, as described above.

If the Jacobian matrix at the solution doesn’t have a full rank, then ‘lm’ method returns a matrix filled with np.inf, on the other hand ‘trf’ and ‘dogbox’ methods use Moore-Penrose pseudoinverse to compute the covariance matrix.

You can calculate the standard deviation errors of the parameters from the square roots of the diagonal elements of the covariance matrix as follows:

Watch the video: Parameter Estimation using Least Squares Method (August 2022).