25 May 2017

Last lecture

  • Revisited UPGMA
  • UPGMA works when distances are ultrametric
  • many distances not ultrametric because we don't have a constant molecular clock
  • Introduced neighbour joining
  • NJ requires only additive distances
  • UPGMA \(O(n^2)\), NJ \(O(n^3)\) (worst case)
  • Rooted and unrooted trees, outgroups

This lecture

  • criticisms of distance based methods
  • parsimony idea
  • finding parsimony score: algorithm
  • reconstructing ancestral states
  • finding max parsimony tree

What is wrong with distance based methods?

Distance based methods use only a coarse summary of the data.

Reduce full \(n \times L\) data matrix (\(n\) taxa,sequences of length \(L\)) to \((n \times n)\) distance matrix.

May produce close to correct tree if data is well-behaved.

But:

  • Data is not usually well-behaved (rate variation, non-treelike signals, complex mutation)

  • no reliable measure of uncertainty (could use bootstrap but doesn't work well)

  • model is not explicit

Parsimony

  • Uses more data than distance based methods
  • A form of Occam's razor: the simplest model explaining the data should be preferred
  • Best tree is the one that requires the fewest changes along it to explain all sequences
  • Tree is called the maximum parsimony tree or sometimes just the parsimony tree
  • No polynomial time algorithm that is guaranteed to find maximum parsimony tree.
  • Have algorithm that calculates the parsimony score for a given tree.

Parsimony idea

Given alignment with 4 sequences:

AAG
AAA
GGA
AGA

Two (out of three) possible relationships between them are:

Out of these two options, we'd choose the one on the left.

Parsimony algorithm: set up

Given tree with sequences at leaves. Consider one site (column) at a time.

Number the nodes, in descending order, so that the root node is \(2n-1\).

Let \(u\) be the site for which we are considering the cost.

Let \(B_u\) be the parsimony cost for that site.

Parsimony algorithm

Initilise Set \(B_u = 0\) and \(k = 2n-1\).

Recursion To obtain the set \(R_k\), the set of possible ancestral values at node \(k\):

If \(k\) is a leaf node: Set \(R_k = x^k_u\).

If \(k\) is not a leaf:

  1. compute \(R_i\) and \(R_j\) for child nodes of \(k\).
  2. Set \(R_k = R_i \cap R_j\) if \(R_i \cap R_j \neq \emptyset\). Otherwise, set \(R_k = R_i \cup R_j\) and set \(B_u = B_u+1\).

Stop Return \(B_u\), the cost of the tree for that site.

Parsimony algorithm: comments

The total cost of the tree is \[B = \sum_{u = 1}^L B_u\]

Parsimony score invariant to position of root.

Note that the recursion results in a post-order traversal of the tree: although the start is from the root, go straight down to leaves and feed information back up towards root.

Example

Example cont

Ancestral reconstruction

Can traceback to get possible ancestral assignments:

Let \(A_k\) be the ancestral assignment at node \(k\).

Start at root so set \(k = 2n-1\).

Choose \(A_k\) uniformly from \(R_{k}\).

For each child of node \(i\) of \(k\), choose \(A_i = A_k\) if \(A_k \in R_i\), else choose \(A_i\) uniformly at random from \(R_i\).

Parsimony example: an ancestral reconstruction

Length of a branch and a tree

The number of mutations along a branch is the length of a branch

Can estimate by counting mutations given by each possible ancestral reconstruction and then taking average

If substitution rate is known, can convert number of substitutions to length of time.

Weighted parsimony

Have different costs for different mutations.

Let \(S(a,b)\) be the cost of mutating from \(a\) to \(b\).

The parsimony score of site \(u\) is given by the algorithm:

Initilise Set \(k = 2n-1\).

Recursion

If \(k\) is a leaf node: Set \(S_k(a) = 0\) when \(a = x^k_u\) and \(S_k(a) = \infty\) otherwise

Weighted parsimony: algorithm cont.

If \(k\) is not a leaf:

  1. compute \(S_i(a)\) and \(S_j(a)\) for all \(a\) and children \(i\) and \(j\) of \(k\).
  2. Set \[S_k(a) = \min_b \left(S_i(b) +S(a,b)\right) + \min_b \left(S_j(b) + S(a,b)\right).\]

Stop Return \[B_u = \min_a S_{2n-1}(a).\]

Reduces to the standard parsimony algorithm when \(S(a,a) = 1\) and \(S(a,b) = 0\) if \(a \neq b\).

Parsimony informative sites

Some sites (columns) have the same score on every tree.

For example, a site with all bases the same will always score zero, regardless of the tree.

Or a site with all bases the same except one will always score one.

To get different scores on different trees, need at least two characters each of which appear in at least 2 taxa.

Example: which sites are parsimony informative?

AAG  
AAA  
GGA  
AGA  
CCC  
  • only column 2

Finding the most parsimonious tree

Sometimes called finding the shortest tree — the number of substitutions is the length of the tree

We are given \(n\) aligned sequences. What is tree with the lowest parsimony score that relates these sequences?

Strategies

  • Exhaustive search
  • Branch and bound
  • Heuristic search

Exhaustive search

Branch and bound

Systematically search all possible unrooted trees

"Branch" refers to method of search, not the edge of a tree.

Builds tree up one taxon (leaf) at a time

Only continues if it could potentially lead to best tree — score of partial tree is less than score of best whole tree.