CS 369: Building trees: UPGMA, Neighbour Joining and Parsimony

23 May 2017

Distance based methods

Usually given sequences

Define a distance between sequences

Calculate distances between each pair of sequences

Build tree from resulting distance matrix

Already seen UPGMA. Will also look at neighbour-joining

Defining distances: Hamming distance

Hamming distance for an aligned pair \(x\), \(y\), \[ D_{xy} = \mbox{ number of places } x \mbox{ and } y \mbox{ differ.} \]

Normalise this by dividing by sequence length \(L\) to get \[ f_{xy} = \frac{ D_{xy}} L = \mbox{ fraction of sites at which } x \mbox{ and } y \mbox{ differ.}\]

\(f\) captures evolutionary distance well for small distances, but doesn't grow very fast.

Jukes-Cantor distance

Based on model where mutations between all four bases are equally likely.

Corrects for fact that unrelated sequences will agree simply due to chance

\[d_{xy} = -\frac34 \log \left (1- \frac{4f_{xy}}3\right).\]

Can use other mutation models to define other distances.

Building trees with UPGMA

UPGMA: Unweighted Pair Group Method using arithmetic Averages

Basic idea:

Start with each object in own cluster and assign (leaf) node to each cluster
Merge closest cluster together and assign (internal) node at height half the distance
if number of clusters is greater than two, go to 2., else stop.

Does UPGMA produce the "correct" tree?

In theory: Yes, so long as the distances are ultrametric.

Tree on left is ultrametric, tree on right is not

Ultrametric distances

Formally, distances are ultrametric when, for all points \(i,j,k\), the distances \(d_{ij}, d_{jk},d_{ik}\) are either all equal or two are equal and the remaining one is smaller

This means that the distances are tree-like and all leaves of tree have same distance to root.

When are distances between sequences ultrametric?

Get ultrametric distances when sequences evolve according to a strict molecular clock:

Molecular clock means that mutations occur at a constant rate across whole tree.

In practice: distances are rarely ultrametric.

When UPGMA doesn't work:

When UPGMA doesn't work

\[ d = \begin{array}{ccccc} & A & B & C & D \\ A & 0 & 0.6 & 0.8 & 1.2 \\ B & & 0 & 0.4 & 0.8 \\ C & & & 0 & 0.6 \\ D & & & & 0 \end{array}\]

In UPGMA, choose the pair with smallest distance first and join them: here, choose B and C, immediately leading to the wrong topology (shape).

Additive distances

A set of additive distances can be thought of as tree-like — there is a tree that correctly displays those distance as branch lengths.

Ultrametric distances are additive but the reverse does not hold

Formally, an additive tree satisfies the four point condition: any four leaves can be relabelled so that \(d(x,y) + d(u,v) \leq d(x,u)+ d(y,v) = d(x,v)+ d(y,z)\).

Any tree produces additive distances when distance is defined as length along branches

Neighbour Joining

Given a set of additive distances, reconstructs the correct tree

Similar to UPGMA but instead of using simple distance matrix, forms the "rate-corrected" distance matrix before joining nearest neighbours.

Neighbour joining idea

Find the nearest neighbour instead of just the node at the smallest distance

Need to subtract the average distance to all other leaves.

Let \[ D_{ij} = d_{ij} - (r_i + r_j) \] where \[ r_i = \frac1{|L| - 2} \sum_{k\in L\setminus i} d_{ik} \] where \(L\) is the number of leaves (sequences).

Can shown (omitted) that the pair of leaves \(i,j\) for which \(D_{ij}\) is minimal are neighbouring leaves.

NJ algorithm progressively builds up a tree \(T\) by keeping a list of active nodes \(L\) and finding the closest amongst them.

NJ schematic from wikipedia

Neighbour joining algorithm

Let \(T\) be the set of all leaf nodes and set \(L = T\).
Iterate until \(|L| = 2\):

Calculate (or update) \(D\) from the distance matrix \(d\).
Pick \(i,j\) for which \(D_{ij}\) is minimal.
Define \(k\) so that \(d_{km} = \frac12(d_{im} + d_{jm} - d_{ij})\) for all \(m \in L\).
Add \(k\) to \(T\) with edges joining to \(i\) and \(j\) with lengths \(d_{ik} =\frac12(d_{ij} + r_i - r_j)\) and \(d_{jk} = d_{ij} - d_{ik}.\)
Set \(L = L - \{i,j\} + {k}\).

\(|L| = 2\), so add remaining edge connect \(i,j\) with length \(d_{ij}\).

NJ example

\[ d = \begin{array}{ccccc} & A & B & C & D \\ A & 0 & 0.6 & 0.8 & 1.2 \\ B & & 0 & 0.4 & 0.8 \\ C & & & 0 & 0.6 \\ D & & & & 0 \end{array}\]

Need to calculate \[ D_{ij} = d_{ij} - (r_i + r_j) \mbox{ where } r_i = \frac1{|L| - 2} \sum_{k\in L\setminus i} d_{ik} \]

Here \(L = 4\), so
\[ r_A = \frac 1 2 (d_{AB} + d_{AC} + d_{AD}) = \frac 1 2 (0.6 + 0.8 + 1.2) = 2.6/2 = 1.3. \]

Others are similar to get \(r = (1.3, 0.9, 0.9, 1.3)\).

\[ d = \begin{array}{ccccc} & A & B & C & D \\ A & 0 & 0.6 & 0.8 & 1.2 \\ B & & 0 & 0.4 & 0.8 \\ C & & & 0 & 0.6 \\ D & & & & 0 \end{array}\]

\(r = (1.3, 0.9, 0.9, 1.3)\)

From \(r\) and \(d\) we can thus calculate \[ D= \begin{array}{ccccc} & A & B & C & D \\ A & &-1.6 &-1.4 &-1.4 \\ B & & &-1.4 &-1.4 \\ C & & & &-1.6\\ D & & && \end{array} \]

The minimum value \(AB\) (and \(CD\)).

Choose \(AB\) to merge into new node \(E\).

The length of the edge from \(A\) to \(E\) is \[d_{AE} = \frac12(d_{AB} + r_A - r_B) = \frac12(0.6 + 1.3 - 0.9 ) = 0.5\]

The length of the edge from \(B\) to \(E\) \[ d_{AE} = d_{AB} - d_{AE} = 0.6 - 0.5 = 0.1\]

\(A\) and \(B\) can now be removed from the leaf set and replaced with \(E\) and a new rate adjusted matrix \(D\) derived.

Further iterations produce the original tree.

Complexity of UPGMA and NJ

Both have space complexity of \(O(n^2)\).

Time complexity for UPGMA is also \(O(n^2)\)

Time complexity for NJ is \(O(n^3)\)

Heuristics exisit to speed up NJ can, in many cases, do better.

Worst case complexity remains \(O(n^3)\).

See M. Simonsen, T. Mailund and C.N.S. Pedersen (2008). Rapid Neighbor-Joining for description of heuristics.

Rooted vs unrooted trees

NJ produces and unrooted tree — we don't know the direction of evolution.

An unrooted binary tree with \(n\) leaves has \(2n-3\) branches. The root could be along any branch.

To decide where the actual root lies, often include an outgroup — a taxon that is known to be distinct from others.

Formally, if the group of taxa \(T\) is being studied, an outgroup with respect to \(T\) is a taxon or a group of taxa that is related to \(T\) but the most recent common ancestor of \(T\) is not ancestral any of the taxa in the outgroup.

Number of trees

Number of unrooted trees: \[\frac{(2n-5)!}{2^{n-3}(n-3)!}\]

Number of rooted trees: \[\frac{(2n-3)!}{2^{n-2}(n-2)!}\]

Number of trees

##  leaves        nUnrootedTrees           nRootedTrees
##       3                     1                      3
##       4                     3                     15
##       5                    15                    105
##       6                   105                    945
##       7                   945                  10395
##       8                 10395                 135135
##       9                135135                2027025
##      10               2027025               34459425
##      11              34459425              654729075
##      12             654729075            13749310575
##      13           13749310575           316234143225
##      14          316234143225          7905853580625
##      15         7905853580625        213458046676875
##      16       213458046676875       6190283353629374
##      17      6190283353629374     191898783962510624
##      18    191898783962510624    6332659870762850304
##      19   6332659870762850304  221643095476699758592
##      20 221643095476699758592 8200794532637890838528

What is wrong with distance based methods?

Distance based methods use only a coarse summary of the data.

Reduce full \(n \times L\) data matrix (\(n\) taxa,sequences of length \(L\)) to \((n \times n)\) distance matrix.

May produce close to correct tree if data is well-bahaved.

But:

Data is not usually well-behaved (rate variation, non-treelike signals, complex mutation)
no reliable measure of uncertainty (could use bootstrap but doesn't work well)
model is not explicit