- Methods that fit a model of molecular evolution to the sequence data are more computationally intensive, but typically show better performance than distance based methods
- How many trees are possible with 12 sequences
- 105
- 10,295
- 2,027,025
- 654,729,025
- The large number of trees makes it difficult to find the tree with the highest likelihood
- Phylogeny programs have to use heuristic approaches to find the 'best' tree
- Starting with an initial tree, make modifications and test whether they give a better tree or not
- Nearest neighbour interchange
- Subtree pruning and regrafting
- Starting with an initial tree, make modifications and test whether they give a better tree or not
- Just like with distance based methods, we assume a model of sequence evolution
- However, we can now test which is the 'best' model
- We often don't need a good tree for this
- In addition to assuming a model of how, for example, one nucleotide changes to another, we can also assume a model of how substitution rates overall vary across the sequence
- Constrained regions e.g. those that are functionally important
- Variable regions e.g. those under immune selection
- Two sorts of models of rate heterogeneity
- Gamma distribution (possibly with an additional invariant category)
- Categorical
- To choose a model, we have to balance model fit (likelihood) with complexity (number of parameters)
- For non-nested models, two criteria are commonly used (lower is better)
-
Akaike's Information Criterion (AIC)
$AIC = 2 k - 2 \ln (L)$
-
Bayesian Information Criterion (BIC)
$BIC = k \ln (n) - 2 \ln (L)$
-
Akaike's Information Criterion (AIC)
- BIC favours simpler models than AIC