asr_mk_model {castor}R Documentation

Ancestral state reconstruction with Mk models and rerooting

Description

Ancestral state reconstruction of a discrete trait using a fixed-rates continuous-time Markov model (a.k.a. "Mk model"). This function can estimate the (instantaneous) transition matrix using maximum likelihood, or take a specified transition matrix. The function can optionally calculate marginal ancestral state estimates for each node in the tree, using the rerooting method by Yang et al. (1995).

Usage

asr_mk_model( tree, 
              tip_states, 
              Nstates = NULL, 
              tip_priors = NULL, 
              rate_model = "ER", 
              transition_matrix = NULL, 
              include_ancestral_likelihoods = TRUE, 
              root_prior = "empirical", 
              Ntrials = 1, 
              optim_algorithm = "nlminb",
              optim_max_iterations = 200,
              optim_rel_tol = 1e-8,
              store_exponentials = TRUE,
              check_input =TRUE, 
              Nthreads = 1)

Arguments

tree

A rooted tree of class "phylo". The root is assumed to be the unique node with no incoming edge.

tip_states

An integer vector of size Ntips, specifying the state of each tip in the tree in terms of an integer from 1 to Nstates, where Ntips is the number of tips and Nstates is the number of possible states (see below). Can also be NULL. If tip_states==NULL, then tip_priors must not be NULL (see below).

Nstates

Either NULL, or an integer specifying the number of possible states of the trait. If Nstates==NULL, then it will be computed based on the maximum value encountered in tip_states or based on the number of columns in tip_priors (whichever is non-NULL).

tip_priors

A 2D numeric matrix of size Ntips x Nstates, where Nstates is the possible number of states for the character modelled. Can also be NULL. Each row of this matrix must be a probability vector, i.e. it must only contain non-negative entries and must sum up to 1. The [i,s]-th entry should be the prior probability of tip i being in state s. If you know for certain that tip i is in some state s, you can set the corresponding entry to 1 and all other entries in that row to 0. If you know the exact states of all tips, you can also pass these via tip_states instead. If tip_priors==NULL, then tip_states must not be NULL (see above).

rate_model

Rate model to be used for fitting the transition rate matrix. Can be "ER" (all rates equal), "SYM" (transition rate i–>j is equal to transition rate j–>i), "ARD" (all rates can be different), "SUEDE" (only stepwise transitions i–>i+1 and i–>i-1 allowed, all 'up' transitions are equal, all 'down' transitions are equal) or "SRD" (only stepwise transitions i–>i+1 and i–>i-1 allowed, and each rate can be different). Can also be an index matrix that maps entries of the transition matrix to the corresponding independent rate parameter to be fitted. Diagonal entries should map to 0, since diagonal entries are not treated as independent rate parameters but are calculated from the remaining entries in the transition matrix. All other entries that map to 0 represent a transition rate of zero. The format of this index matrix is similar to the format used by the ace function in the ape package. rate_model is only relevant if transition_matrix==NULL.

transition_matrix

Either a numeric quadratic matrix of size Nstates x Nstates containing fixed transition rates, or NULL. The [r,c]-th entry in this matrix should store the transition rate from state r to state c. Each row in this matrix must have sum zero. If NULL, then the transition rates will be estimated using maximum likelihood, based on the rate_model specified.

root_prior

Prior probability distribution of the root's states. Can be "flat" (all states equal), "empirical" (empirical probability distribution of states across the tree's tips) or "stationary" (stationary probability distribution of the transition matrix). If "stationary" and transition_matrix==NULL, then a transition matrix is first fitted using a flat root prior, and then used to calculate the stationary distribution. root_prior can also be a non-negative numeric vector of size Nstates and with total sum equal to 1.

include_ancestral_likelihoods

Include the marginal ancestral state distribution for each node (conditional scaled likelihoods after rerooting) in the return values. Note that this may increase the computation time and memory needed, so you may set this to FALSE if you don't need marginal ancestral states.

Ntrials

Number of trials (starting points) for fitting the transition matrix. Only relevant if transition_matrix=NULL. A higher number may reduce the risk of landing in a local non-global optimum of the likelihood function, but will increase computation time during fitting.

optim_algorithm

Either "optim" or "nlminb", specifying which optimization algorithm to use for maximum-likelihood estimation of the transition matrix. Only relevant if transition_matrix==NULL.

optim_max_iterations

Maximum number of iterations (per fitting trial) allowed for optimizing the likelihood function.

optim_rel_tol

Relative tolerance (stop criterion) for optimizing the likelihood function.

store_exponentials

Logical, specifying whether to pre-calculate and store exponentials of the transition matrix during calculation of ancestral likelihoods. This may reduce computation time because each exponential is only calculated once, but requires more memory since all exponentials are stored.

Only relevant if include_ancestral_likelihoods==TRUE, otherwise exponentials are never stored.

check_input

Logical, specifying whether to perform some basic checks on the validity of the input data. If you are certain that your input data are valid, you can set this to FALSE to reduce computation.

Nthreads

Number of parallel threads to use for running multiple fitting trials simultaneously. This only makes sense if your computer has multiple cores/CPUs and if Ntrials>1, and is only relevant if transition_matrix==NULL. This option is ignored on Windows, because Windows does not support forking.

Details

For this function, the trait's states must be represented by integers within 1,..,Nstates, where Nstates is the total number of possible states. If the states are originally in some other format (e.g. characters or factors), you should map them to a set of integers 1,..,Nstates. The order of states (if relevant) should be reflected in their integer representation. For example, if your original states are "small", "medium" and "large" and rate_model=="SUEDE", it is advised to represent these states as integers 1,2,3. You can easily map any set of discrete states to integers using the function map_to_state_space.

This function allows the specification of the precise tip states (if these are known) using the vector tip_states. Alternatively, if some tip states are only known in terms of a probability distribution, you can pass these probability distributions using the matrix tip_priors. Note that exactly one of the two arguments, tip_states or tip_priors, must be non-NULL.

Tips must be represented in tip_states or tip_priors in the same order as in tree$tip.label. None of the input vectors or matrixes need include row or column names; if they do, however, they are checked for consistency (if check_input==TRUE).

The rerooting method by Yang et al (2015) is used to reconstruct the marginal ancestral state probabilities for each node by treating the node as a root and calculating its conditional scaled likelihoods. Note that the re-rooting algorithm is strictly speaking only valid for reversible Mk models, that is, satisfying the criterion

π_i Q_{ij}=π_j Q_{ji},\quad\forall i,j,

where Q is the transition rate matrix and π is the stationary distribution of the model. The rate models “ER”, 'SYM”, “SUEDE” and “SRD” are reversible. For example, for “SUEDE” or “SRD” choose π_{i+1}=π_{i}Q_{i,i+1}/Q_{i+1,i}. In contrast, “ARD” models are generally not reversible.

If tree$edge.length is missing, each edge in the tree is assumed to have length 1. The tree may include multi-furcations (i.e. nodes with more than 2 children) as well as mono-furcations (i.e. nodes with only one child). This function is similar to rerootingMethod in the phytools package (v0.5-64) and similar to ape::ace (v4.1) with options method="ML", type="discrete" and marginal=FALSE, but tends to be much faster than rerootingMethod and ace for large trees.

Value

A list with the following elements:

success

Logical, indicating whether ASR was successful. If FALSE, all other return values may be NULL.

transition_matrix

A numeric quadratic matrix of size Nstates x Nstates, containing the transition rates of the Markov model. The [r,c]-th entry is the transition rate from state r to state c. Will be the same as the input transition_matrix, if the latter was not NULL.

loglikelihood

Log-likelihood of the Markov model. If transition_matrix was NULL in the input, then this will be the log-likelihood maximized during fitting.

ancestral_likelihoods

Optional, only returned if include_ancestral_likelihoods was TRUE. A 2D numeric matrix, listing the probability of each node being in each state (marginal ancestral state distributions). This matrix will have size Nnodes x Nstates, where Nstates was either explicitly provided as an argument, or inferred from tip_states or tip_priors (whichever was non-NULL). The rows in this matrix will be in the order in which nodes are indexed in the tree, i.e. the [n,s]-th entry will be the probability of the s-th state for the n-th node. For example, likelihoods[1,3] will store the probability that the first node was in state 3.

Author(s)

Stilianos Louca

References

Z. Yang, S. Kumar and M. Nei (1995). A new method for inference of ancestral nucleotide and amino acid sequences. Genetics. 141:1641-1650.

See Also

hsp_mk_model, asr_max_parsimony, asr_squared_change_parsimony, hsp_max_parsimony, map_to_state_space

Examples

# generate random tree
Ntips = 1000
tree  = generate_random_tree(list(birth_rate_intercept=1),max_tips=Ntips)$tree

# create random transition matrix
Nstates = 5
Q = get_random_mk_transition_matrix(Nstates, rate_model="ER", max_rate=0.01)
cat(sprintf("Simulated ER transition rate=%g\n",Q[1,2]))

# simulate the trait's evolution
simulation = simulate_mk_model(tree, Q)
tip_states = simulation$tip_states
cat(sprintf("Simulated states for last 20 nodes:\n"))
print(tail(simulation$node_states,20))

# reconstruct node states from simulated tip states
results = asr_mk_model(tree, tip_states, Nstates, rate_model="ER", Ntrials=2)
node_states = max.col(results$ancestral_likelihoods)

# print Mk model fitting summary
cat(sprintf("Mk model: log-likelihood=%g\n",results$loglikelihood))
cat(sprintf("Fitted ER transition rate=%g\n",results$transition_matrix[1,2]))

# print reconstructed node states for last 20 nodes
print(tail(node_states,20))

[Package castor version 1.3.3 Index]