26 May 2017

Last Lecture

  • Parsimony states that the best tree is the one that requires the fewest substitutions (mutations)
  • Given tree with sequences at tips, can calculate parsimony score easily
  • Can reconstruct ancestral states
  • Branch lengths are given in substitutions
  • Only some sites are parsimony informative
  • Finding parsimony tree is hard
  • Exhaustive search not feasible

This lecture

  • Branch and bound
  • Heuristic search
  • Problems with parsimony:
  1. underestimates branch lengths (due to repeat mutations)
  2. long branch attraction (due repeat mutations and convergent evolution)
  3. statistically inconsistent (because of long branch attraction)
  • Likelihood based approaches to building trees

Branch and bound

Systematically search all possible unrooted trees

"Branch" refers to method of search, not the edge of a tree.

Builds tree up one taxon (leaf) at a time

Only continues if it could potentially lead to best tree — score of partial tree is less than score of best whole tree.

Branch and bound algorithm

Initialise: Build an initial tree, \(t^*\) using some method (e.g., UPGMA or NJ) and let score \(t^* = s^*\).

Choose 3 (distant) taxa and form the unique unrooted 3-taxon tree.

Add this tree to a queue.

Iterate: Choose a taxon and add to previous best partial tree (at front of queue) in each possible position to get \(k\) new partial trees, \(t_1,\ldots,t_k\)

  1. If \(score(t_i) \leq s^*\), add \(t_i\) to queue and order the queue by score and number of taxa
  2. If \(score(t_i) > s^*\), discard \(t_i\).
  3. If \(t_i\) is complete (all taxa have been added) and \(score(t_i) < s^*\), set \(s^* = score(t_i)\).

Finish: When queue is empty, return tree with lowest score.

Branch and bound

Depth-first search — follow most promising point as far as possible

Better the initial bound, the faster the algorithm.

Suggests a method of use as in this example:

  • another method finds best score of 506,
  • but Branch and Bound taking too long with bound of 506.
  • So set bound at 502.
  • Either find tree with score \(\leq 502\) or establish no such tree.
  • In latter case, could try bound of 503 etc.

Feasible for 20-30 taxa, depending on data.

Heuristic search

Subtree prune and regraft to modify trees

What is wrong with parsimony

  • No generative model of evolutionary process
  • Doesn't capture hidden or multiple mutations
  • long-branch attraction can cause wrong tree topology
  • long-branch attraction causes statistical inconsistency: with more data become more sure about wrong tree

Likelihood and model based approaches

How would a model look?

Given data \(D\) consisting of \(n\) sequences of length \(L\).

\(D\) is \(n \times L\) matrix.

Sequences evolved on some tree \(g\) according to some model of substitution process with parameters \(\mu\)

Want to find \(g\) and \(\mu\).

To find \(g\) and \(\mu\), use Bayes' theorem:

\[ P(g,\mu |D) = \frac{P(D|g,\mu)P(g,\mu)}{P(D)} \]

  • \(P(g,\mu |D)\) is posterior: belief about parameters after seeing data
  • \(P(D|g,\mu)\) is likehood: chance of seeing data given parameters
  • \(P(g, \mu)\) is prior: prior belief about tree and parameters
  • We haven't got time to explore the full Bayesian solution (see BioInf 702)
  • We'll just look at the likelihood and how to maximise it

Likelihood vs parsimony

Need to account for all possible ancestral states

Likelihood of a tree is probability of mutation across each branch