| fit_hbds_model_on_grid {castor} | R Documentation |
Given a timetree (potentially sampled through time and not necessarily ultrametric), fit a homogenous birth-death-sampling (HBDS) model in which speciation, extinction and lineage sampling occurs at some continuous (Poissonian) rates λ, μ and ψ, which are defined on a fixed grid of discrete time points and assumed to vary polynomially between grid points. Sampled lineages are kept in the pool of extant lineages at some “retention probability” κ, which may also depend on time. In addition, this model can include concentrated sampling attempts (CSAs) at a finite set of discrete time points t_1,..,t_m. “Homogenous” refers to the assumption that, at any given moment in time, all lineages exhibit the same speciation/extinction/sampling rates. Every HBDS model is thus defined based on the values that λ, μ, ψ and κ take over time, as well as the sampling probabilities ρ_1,..,ρ_m and retention probabilities κ_1,..,κ_m during the concentrated sampling attempts. This function estimates the values of λ, μ, ψ and κ on each grid point, as well as the ρ_1,..,ρ_m and κ_1,..,κ_m, by maximizing the corresponding likelihood of the timetree. Special cases of this model (when rates are piecewise constant through time) are sometimes known as “birth-death-skyline plots” in the literature (Stadler 2013). In epidemiology, these models are often used to describe the phylogenies of viral strains sampled over the course of the epidemic.
fit_hbds_model_on_grid( tree,
root_age = NULL,
oldest_age = NULL,
age_grid = NULL,
CSA_ages = NULL,
min_lambda = 0,
max_lambda = +Inf,
min_mu = 0,
max_mu = +Inf,
min_psi = 0,
max_psi = +Inf,
min_kappa = 0,
max_kappa = 1,
min_CSA_probs = 0,
max_CSA_probs = 1,
min_CSA_kappas = 0,
max_CSA_kappas = 1,
guess_lambda = NULL,
guess_mu = NULL,
guess_psi = NULL,
guess_kappa = NULL,
guess_CSA_probs = NULL,
guess_CSA_kappas = NULL,
fixed_lambda = NULL,
fixed_mu = NULL,
fixed_psi = NULL,
fixed_kappa = NULL,
fixed_CSA_probs = NULL,
fixed_CSA_kappas = NULL,
fixed_age_grid = NULL,
const_lambda = FALSE,
const_mu = FALSE,
const_psi = FALSE,
const_kappa = FALSE,
const_CSA_probs = FALSE,
const_CSA_kappas = FALSE,
splines_degree = 1,
condition = "auto",
ODE_relative_dt = 0.001,
ODE_relative_dy = 1e-3,
CSA_age_epsilon = NULL,
Ntrials = 1,
max_start_attempts = 1,
Nthreads = 1,
max_model_runtime = NULL,
Nbootstraps = 0,
Ntrials_per_bootstrap = NULL,
fit_control = list(),
focal_param_values = NULL,
verbose = FALSE,
diagnostics = FALSE,
verbose_prefix = "")
tree |
A timetree of class "phylo", representing the time-calibrated reconstructed phylogeny of a set of extant and/or extinct species. Tips of the tree are interpreted as terminally sampled lineages, while monofurcating nodes are interpreted as non-terminally sampled lineages, i.e., lineages sampled at some past time point and with subsequently sampled descendants. |
root_age |
Positive numeric, specifying the age of the tree's root. Can be used to define a time offset, e.g. if the last tip was not actually sampled at the present. If |
oldest_age |
Strictly positive numeric, specifying the oldest time before present (“age”) to consider when calculating the likelihood. If this is equal to or greater than the root age, then |
age_grid |
Numeric vector, listing ages in ascending order, on which λ, μ, ψ and κ are fitted and allowed to vary independently. This grid must cover at least the age range from the present (age 0) to |
CSA_ages |
Optional numeric vector, listing ages (in ascending order) at which concentrated sampling attempts (CSAs) occurred. If |
min_lambda |
Numeric vector of length Ngrid (= |
max_lambda |
Numeric vector of length Ngrid, or a single numeric, specifying upper bounds for the fitted speciation rate λ at each point in the age grid. If a single numeric, the same upper bound applies at all ages. Use |
min_mu |
Numeric vector of length Ngrid, or a single numeric, specifying lower bounds for the fitted extinction rate μ at each point in the age grid. If a single numeric, the same lower bound applies at all ages. |
max_mu |
Numeric vector of length Ngrid, or a single numeric, specifying upper bounds for the fitted extinction rate μ at each point in the age grid. If a single numeric, the same upper bound applies at all ages. Use |
min_psi |
Numeric vector of length Ngrid, or a single numeric, specifying lower bounds for the fitted Poissonian sampling rate ψ at each point in the age grid. If a single numeric, the same lower bound applies at all ages. |
max_psi |
Numeric vector of length Ngrid, or a single numeric, specifying upper bounds for the fitted Poissonian sampling rate ψ at each point in the age grid. If a single numeric, the same upper bound applies at all ages. Use |
min_kappa |
Numeric vector of length Ngrid, or a single numeric, specifying lower bounds for the fitted retention probability κ at each point in the age grid. If a single numeric, the same lower bound applies at all ages. |
max_kappa |
Numeric vector of length Ngrid, or a single numeric, specifying upper bounds for the fitted retention probability κ at each point in the age grid. If a single numeric, the same upper bound applies at all ages. Use |
min_CSA_probs |
Numeric vector of length NCSA (= |
max_CSA_probs |
Numeric vector of length NCSA, or a single numeric, specifying upper bounds for the fitted sampling probabilities ρ_1, ρ_2, ... at each concentrated sampling attempt. If a single numeric, the same upper bound applies at all CSAs. Note that, since ρ_1, ρ_2, ... are probabilities, |
min_CSA_kappas |
Numeric vector of length NCSA, or a single numeric, specifying lower bounds for the fitted retention probabilities κ_1, κ_2, ... at each concentrated sampling attempt. If a single numeric, the same lower bound applies at all CSAs. Note that, since κ_1, κ_2, ... are probabilities, |
max_CSA_kappas |
Numeric vector of length NCSA, or a single numeric, specifying upper bounds for the fitted sampling probabilities κ_1, κ_2, ... at each concentrated sampling attempt. If a single numeric, the same upper bound applies at all CSAs. Note that, since κ_1, κ_2, .. are probabilities, |
guess_lambda |
Initial guess for λ at each age-grid point. Either |
guess_mu |
Initial guess for μ at each age-grid point. Either |
guess_psi |
Initial guess for ψ at each age-grid point. Either |
guess_kappa |
Initial guess for κ at each age-grid point. Either |
guess_CSA_probs |
Initial guess for the ρ_1, ρ_2, ... at each concentrated sampling attempt. Either |
guess_CSA_kappas |
Initial guess for the κ_1, κ_2, ... at each concentrated sampling attempt. Either |
fixed_lambda |
Optional fixed (i.e. non-fitted) λ values on one or more age-grid points. Either |
fixed_mu |
Optional fixed (i.e. non-fitted) μ values on one or more age-grid points. Either |
fixed_psi |
Optional fixed (i.e. non-fitted) ψ values on one or more age-grid points. Either |
fixed_kappa |
Optional fixed (i.e. non-fitted) κ values on one or more age-grid points. Either |
fixed_CSA_probs |
Optional fixed (i.e. non-fitted) ρ_1, ρ_2, ... values on one or more age-grid points. Either |
fixed_CSA_kappas |
Optional fixed (i.e. non-fitted) κ_1, κ_2, ... values on one or more age-grid points. Either |
fixed_age_grid |
Optional numeric vector, specifying an age grid on which This option may be useful if you want to fit some parameters on a coarse grid, but want to specify (fix) some other parameters on a much finer grid. Also note that if |
const_lambda |
Logical, specifying whether λ should be assumed constant across the grid, i.e. time-independent. Setting |
const_mu |
Logical, specifying whether μ should be assumed constant across the grid, i.e. time-independent. Setting |
const_psi |
Logical, specifying whether ψ should be assumed constant across the grid, i.e. time-independent. Setting |
const_kappa |
Logical, specifying whether κ should be assumed constant across the grid, i.e. time-independent. Setting |
const_CSA_probs |
Logical, specifying whether the ρ_1, ρ_2, ... should be the same across all CSAs. Setting |
const_CSA_kappas |
Logical, specifying whether the κ_1, κ_2, ... should be the same across all CSAs. Setting |
splines_degree |
Integer between 0 and 3 (inclusive), specifying the polynomial degree of λ, μ, ψ and κ between age-grid points. If 0, then λ, μ, ψ and κ are considered piecewise constant, if 1 they are considered piecewise linear, if 2 or 3 they are considered to be splines of degree 2 or 3, respectively. The |
condition |
Character, either "crown", "stem", "none" or "auto", specifying on what to condition the likelihood. If "crown", the likelihood is conditioned on the survival of the two daughter lineages branching off at the root. If "stem", the likelihood is conditioned on the survival of the stem lineage. Note that "crown" really only makes sense when |
ODE_relative_dt |
Positive unitless number, specifying the default relative time step for the ordinary differential equation solvers. Typical values are 0.01-0.001. |
ODE_relative_dy |
Positive unitless number, specifying the relative difference between subsequent simulated and interpolated values, in internally used ODE solvers. Typical values are 1e-2 to 1e-5. A smaller |
CSA_age_epsilon |
Non-negative numeric, in units of time, specfying the age radius around a concentrated sampling attempt, within which to assume that sampling events were due to that concentrated sampling attempt. If |
Ntrials |
Integer, specifying the number of independent fitting trials to perform, each starting from a random choice of model parameters. Increasing |
max_start_attempts |
Integer, specifying the number of times to attempt finding a valid start point (per trial) before giving up on that trial. Randomly choosen extreme start parameters may occasionally result in Inf/undefined likelihoods, so this option allows the algorithm to keep looking for valid starting points. |
Nthreads |
Integer, specifying the number of parallel threads to use for performing multiple fitting trials simultaneously. This should generally not exceed the number of available CPUs on your machine. Parallel computing is not available on the Windows platform. |
max_model_runtime |
Optional numeric, specifying the maximum number of seconds to allow for each evaluation of the likelihood function. Use this to abort fitting trials leading to parameter regions where the likelihood takes a long time to evaluate (these are often unlikely parameter regions). |
Nbootstraps |
Integer, specifying the number of parametric bootstraps to perform for estimating standard errors and confidence intervals of estimated parameters. Set to 0 for no bootstrapping. |
Ntrials_per_bootstrap |
Integer, specifying the number of fitting trials to perform for each bootstrap sampling. If |
fit_control |
Named list containing options for the |
focal_param_values |
Optional list, listing combinations of parameter values of particular interest and for which the log-likelihoods should be returned. Every element of this list should itself be a named list, containing the elements |
verbose |
Logical, specifying whether to print progress reports and warnings to the screen. Note that errors always cause a return of the function (see return values |
diagnostics |
Logical, specifying whether to print detailed information (such as model likelihoods) at every iteration of the fitting routine. For debugging purposes mainly. |
verbose_prefix |
Character, specifying the line prefix for printing progress reports to the screen. |
Warning: In the absence of concentrated sampling attempts (NCSA=0), and without well-justified a priori constraints on either λ, μ, ψ and/or κ, it is generally impossible to reliably estimate λ, μ, ψ and κ from timetrees alone. This routine (and any other software that claims to estimate λ, μ, ψ and κ solely from timetrees) should thus be treated with great suspicion. Many epidemiological models make the (often reasonable assumption) that κ=0; note that even in this case, one generally can't co-estimate λ, μ and ψ from the timetree alone.
It is advised to provide as much information to the function fit_hbds_model_on_grid as possible, including reasonable lower and upper bounds (min_lambda, max_lambda, min_mu, max_mu, min_psi, max_psi, min_kappa, max_kappa) and reasonable parameter guesses. It is also important that the age_grid is sufficiently fine to capture the expected major variations of λ, μ, ψ and κ over time, but keep in mind the serious risk of overfitting when age_grid is too fine and/or the tree is too small. The age_grid does not need to be uniform, i.e., you may want to use a finer grid in regions where there's more data (tips) available. If strong lower and upper bounds are not available and fitting takes a long time to run, consider using the option max_model_runtime to limit how much time the fitting allows for each evaluation of the likelihood.
Note that here "age" refers to time before present, i.e., age increases from tips to root and age 0 is present-day. CSAs are enumerated in the order of increasing age, i.e., from the present to the past. Similarly, the age grid specifies time points from the present towards the past.
A list with the following elements:
success |
Logical, indicating whether model fitting succeeded. If |
objective_value |
The maximized fitting objective. Currently, only maximum-likelihood estimation is implemented, and hence this will always be the maximized log-likelihood. |
objective_name |
The name of the objective that was maximized during fitting. Currently, only maximum-likelihood estimation is implemented, and hence this will always be “loglikelihood”. |
loglikelihood |
The log-likelihood of the fitted model for the given timetree. |
guess_loglikelihood |
The log-likelihood of the guessed model for the given timetree. |
param_fitted |
Named list, specifying the fixed and fitted model parameters. This list will contain the elements |
param_guess |
Named list, specifying the guessed model parameters. This list will contain the elements |
age_grid |
Numeric vector of size NG, the age-grid on which λ, μ, ψ and κ are defined. This will be the same as the provided |
CSA_ages |
Numeric vector of size NCSA, ting listhe ages at which concentrated sampling attempts occurred. This is the same as provided to the function. |
NFP |
Integer, number of free (i.e., independently) fitted parameters. If none of the λ, μ and ρ were fixed, and |
Ndata |
Integer, the number of data points (sampling and branching events) used for fitting. |
AIC |
The Akaike Information Criterion for the fitted model, defined as 2k-2\log(L), where k is the number of fitted parameters and L is the maximized likelihood. |
BIC |
The Bayesian information criterion for the fitted model, defined as \log(n)k-2\log(L), where k is the number of fitted parameters, n is the number of data points (number of branching times), and L is the maximized likelihood. |
condition |
Character, specifying what conditioning was root for the likelihood (e.g. "crown" or "stem"). |
converged |
Logical, specifying whether the maximum likelihood was reached after convergence of the optimization algorithm. Note that in some cases the maximum likelihood may have been achieved by an optimization path that did not yet converge (in which case it's advisable to increase |
Niterations |
Integer, specifying the number of iterations performed during the optimization path that yielded the maximum likelihood. |
Nevaluations |
Integer, specifying the number of likelihood evaluations performed during the optimization path that yielded the maximum likelihood. |
standard_errors |
Named list specifying the standard errors of the parameters, based on parametric bootstrapping. This list will contain the elements |
CI50lower |
Named list specifying the lower end of the 50% confidence interval (i.e. the 25% quantile) for each parameter, based on parametric bootstrapping. This list will contain the elements |
CI50upper |
Similar to |
CI95lower |
Similar to |
CI95upper |
Similar to |
consistency |
Numeric between 0 and 1, estimated consistency of the data with the fitted model. If L denotes the loglikelihood of new data generated by the fitted model (under the same model) and M denotes the expectation of L, then |
Stilianos Louca
T. Stadler, D. Kuehnert, S. Bonhoeffer, A. J. Drummond (2013). Birth-death skyline plot reveals temporal changes of epidemic spread in HIV and hepatitis C virus (HCV). PNAS. 110:228-233.
A. Lindholm, D. Zachariah, P. Stoica, T. B. Schoen (2019). Data consistency approach to model validation. IEEE Access. 7:59788-59796.
simulate_deterministic_hbds,
fit_hbds_model_parametric
## Not run:
# define lambda & mu & psi as functions of time
# Assuming an exponentially varying lambda & mu, and a constant psi
time2lambda = function(times){ 2*exp(0.1*times) }
time2mu = function(times){ 0.1*exp(0.09*times) }
time2psi = function(times){ rep(0.2, times=length(times)) }
# define concentrated sampling attempts
CSA_times = c(3,4)
CSA_probs = c(0.1, 0.2)
# generate random tree based on lambda, mu & psi
# assume that all sampled lineages are removed from the pool (i.e. kappa=0)
time_grid = seq(from=0, to=100, by=0.01)
simul = generate_tree_hbds( max_time = 5,
time_grid = time_grid,
lambda = time2lambda(time_grid),
mu = time2mu(time_grid),
psi = time2psi(time_grid),
kappa = 0,
CSA_times = CSA_times,
CSA_probs = CSA_probs,
CSA_kappas = 0)
tree = simul$tree
root_age = simul$root_age
cat(sprintf("Tree has %d tips\n",length(tree$tip.label)))
# Define an age grid on which lambda_function & mu_function shall be fitted
fit_age_grid = seq(from=0,to=root_age,length.out=3)
# Fit an HBDS model on a grid
# Assume that psi is known and that sampled lineages are removed from the pool
# Hence, we only fit lambda & mu & CSA_probs
cat(sprintf("Fitting model to tree..\n"))
fit = fit_hbds_model_on_grid(tree,
root_age = root_age,
age_grid = fit_age_grid,
CSA_ages = rev(simul$final_time - CSA_times),
fixed_psi = time2psi(simul$final_time-fit_age_grid),
fixed_kappa = 0,
fixed_CSA_kappas = 0,
Ntrials = 4,
Nthreads = 4,
Nbootstraps = 0,
verbose = TRUE,
verbose_prefix = " ")
if(!fit$success){
cat(sprintf("ERROR: Fitting failed: %s\n",fit$error))
}else{
# compare fitted lambda to true lambda
plot(x=fit$age_grid,
y=fit$param_fitted$lambda,
type='l',
col='#000000',
xlim=c(root_age,0),
xlab='age',
ylab='lambda')
lines(x=simul$final_time-time_grid,
y=time2lambda(time_grid),
type='l',
col='#0000AA')
}
# compare true and fitted model in terms of their LTTs
LTT = castor::count_lineages_through_time(tree, Ntimes=100, include_slopes=TRUE)
LTT$ages = root_age - LTT$times
cat(sprintf("Simulating deterministic HBDS (true model)..\n"))
age0 = 0.5 # reference age at which to equate LTTs
LTT0 = approx(x=LTT$ages, y=LTT$lineages, xout=age0)$y # tree LTT at age0
fsim = simulate_deterministic_hbds( age_grid = fit$age_grid,
lambda = fit$param_fitted$lambda,
mu = fit$param_fitted$mu,
psi = fit$param_fitted$psi,
kappa = fit$param_fitted$kappa,
CSA_ages = fit$CSA_ages,
CSA_probs = fit$param_fitted$CSA_probs,
CSA_kappas = fit$param_fitted$CSA_kappas,
requested_ages = seq(0,root_age,length.out=200),
age0 = age0,
LTT0 = LTT0,
splines_degree = 1)
if(!fsim$success){
cat(sprintf("ERROR: Could not simulate fitted model: %s\n",fsim$error))
stop()
}
plot(x=LTT$ages, y=LTT$lineages, type='l', col='#0000AA', lwd=2, xlim=c(root_age,0))
lines(x=fsim$ages, y=fsim$LTT, type='l', col='#000000', lwd=2)
## End(Not run)