CS 369: Parsimony

25 May 2017

Last lecture

Revisited UPGMA
UPGMA works when distances are ultrametric
many distances not ultrametric because we don't have a constant molecular clock
Introduced neighbour joining
NJ requires only additive distances
UPGMA \(O(n^2)\), NJ \(O(n^3)\) (worst case)
Rooted and unrooted trees, outgroups

This lecture

criticisms of distance based methods
parsimony idea
finding parsimony score: algorithm
reconstructing ancestral states
finding max parsimony tree

What is wrong with distance based methods?

Distance based methods use only a coarse summary of the data.

Reduce full \(n \times L\) data matrix (\(n\) taxa,sequences of length \(L\)) to \((n \times n)\) distance matrix.

May produce close to correct tree if data is well-behaved.

But:

Data is not usually well-behaved (rate variation, non-treelike signals, complex mutation)
no reliable measure of uncertainty (could use bootstrap but doesn't work well)
model is not explicit

Parsimony

Uses more data than distance based methods

A form of Occam's razor: the simplest model explaining the data should be preferred

Best tree is the one that requires the fewest changes along it to explain all sequences

Tree is called the maximum parsimony tree or sometimes just the parsimony tree

No polynomial time algorithm that is guaranteed to find maximum parsimony tree.

Have algorithm that calculates the parsimony score for a given tree.

Parsimony idea

Given alignment with 4 sequences:

AAG
AAA
GGA
AGA

Two (out of three) possible relationships between them are:

Out of these two options, we'd choose the one on the left.

Parsimony algorithm: set up

Given tree with sequences at leaves. Consider one site (column) at a time.

Number the nodes, in descending order, so that the root node is \(2n-1\).

Let \(u\) be the site for which we are considering the cost.

Let \(B_u\) be the parsimony cost for that site.

Parsimony algorithm

Initilise Set \(B_u = 0\) and \(k = 2n-1\).

Recursion To obtain the set \(R_k\), the set of possible ancestral values at node \(k\):

If \(k\) is a leaf node: Set \(R_k = x^k_u\).

If \(k\) is not a leaf:

compute \(R_i\) and \(R_j\) for child nodes of \(k\).
Set \(R_k = R_i \cap R_j\) if \(R_i \cap R_j \neq \emptyset\). Otherwise, set \(R_k = R_i \cup R_j\) and set \(B_u = B_u+1\).

Stop Return \(B_u\), the cost of the tree for that site.

Parsimony algorithm: comments

The total cost of the tree is \[B = \sum_{u = 1}^L B_u\]

Parsimony score invariant to position of root.

Note that the recursion results in a post-order traversal of the tree: although the start is from the root, go straight down to leaves and feed information back up towards root.

Example

Example cont

Ancestral reconstruction

Can traceback to get possible ancestral assignments:

Let \(A_k\) be the ancestral assignment at node \(k\).

Start at root so set \(k = 2n-1\).

Choose \(A_k\) uniformly from \(R_{k}\).

For each child of node \(i\) of \(k\), choose \(A_i = A_k\) if \(A_k \in R_i\), else choose \(A_i\) uniformly at random from \(R_i\).

Parsimony example: an ancestral reconstruction

Length of a branch and a tree

The number of mutations along a branch is the length of a branch

Can estimate by counting mutations given by each possible ancestral reconstruction and then taking average

If substitution rate is known, can convert number of substitutions to length of time.

Weighted parsimony

Have different costs for different mutations.

Let \(S(a,b)\) be the cost of mutating from \(a\) to \(b\).

The parsimony score of site \(u\) is given by the algorithm:

Initilise Set \(k = 2n-1\).

Recursion

If \(k\) is a leaf node: Set \(S_k(a) = 0\) when \(a = x^k_u\) and \(S_k(a) = \infty\) otherwise

Weighted parsimony: algorithm cont.

If \(k\) is not a leaf:

compute \(S_i(a)\) and \(S_j(a)\) for all \(a\) and children \(i\) and \(j\) of \(k\).
Set \[S_k(a) = \min_b \left(S_i(b) +S(a,b)\right) + \min_b \left(S_j(b) + S(a,b)\right).\]

Stop Return \[B_u = \min_a S_{2n-1}(a).\]

Reduces to the standard parsimony algorithm when \(S(a,a) = 1\) and \(S(a,b) = 0\) if \(a \neq b\).

Parsimony informative sites

Some sites (columns) have the same score on every tree.

For example, a site with all bases the same will always score zero, regardless of the tree.

Or a site with all bases the same except one will always score one.

To get different scores on different trees, need at least two characters each of which appear in at least 2 taxa.

Example: which sites are parsimony informative?

AAG  
AAA  
GGA  
AGA  
CCC

only column 2

Finding the most parsimonious tree

Sometimes called finding the shortest tree — the number of substitutions is the length of the tree

We are given \(n\) aligned sequences. What is tree with the lowest parsimony score that relates these sequences?

Strategies

Exhaustive search
Branch and bound
Heuristic search

Exhaustive search

Enumerate all unrooted trees

For each tree, find the parsimony score of the tree

Pick tree(s) with lowest score

Recall how many trees there are:

##  leaves      nUnrootedTrees          nRootedTrees
##      10             2027025              34459425
##      14        316234143225         7905853580625
##      16     213458046676875      6190283353629374
##      17    6190283353629374    191898783962510624
##      18  191898783962510624   6332659870762850304
##      19 6332659870762850304 221643095476699758592

Quickly becomes unfeasible, ok for \(n \approx 10\)

Branch and bound

Systematically search all possible unrooted trees

"Branch" refers to method of search, not the edge of a tree.

Builds tree up one taxon (leaf) at a time

Only continues if it could potentially lead to best tree — score of partial tree is less than score of best whole tree.