23 May 2017
Hamming distance for an aligned pair \(x\), \(y\), \[ D_{xy} = \mbox{ number of places } x \mbox{ and } y \mbox{ differ.} \]
Normalise this by dividing by sequence length \(L\) to get \[ f_{xy} = \frac{ D_{xy}} L = \mbox{ fraction of sites at which } x \mbox{ and } y \mbox{ differ.}\]
\(f\) captures evolutionary distance well for small distances, but doesn't grow very fast.
Based on model where mutations between all four bases are equally likely.
Corrects for fact that unrelated sequences will agree simply due to chance
\[d_{xy} = -\frac34 \log \left (1- \frac{4f_{xy}}3\right).\]
Can use other mutation models to define other distances.
UPGMA: Unweighted Pair Group Method using arithmetic Averages
Basic idea:
In theory: Yes, so long as the distances are ultrametric.
Formally, distances are ultrametric when, for all points \(i,j,k\), the distances \(d_{ij}, d_{jk},d_{ik}\) are either all equal or two are equal and the remaining one is smaller
This means that the distances are tree-like and all leaves of tree have same distance to root.
Get ultrametric distances when sequences evolve according to a strict molecular clock:
Molecular clock means that mutations occur at a constant rate across whole tree.
In practice: distances are rarely ultrametric.
\[ d = \begin{array}{ccccc} & A & B & C & D \\ A & 0 & 0.6 & 0.8 & 1.2 \\ B & & 0 & 0.4 & 0.8 \\ C & & & 0 & 0.6 \\ D & & & & 0 \end{array}\]
In UPGMA, choose the pair with smallest distance first and join them: here, choose B and C, immediately leading to the wrong topology (shape).
A set of additive distances can be thought of as tree-like — there is a tree that correctly displays those distance as branch lengths.
Ultrametric distances are additive but the reverse does not hold
Formally, an additive tree satisfies the four point condition: any four leaves can be relabelled so that \(d(x,y) + d(u,v) \leq d(x,u)+ d(y,v) = d(x,v)+ d(y,z)\).
Given a set of additive distances, reconstructs the correct tree
Similar to UPGMA but instead of using simple distance matrix, forms the "rate-corrected" distance matrix before joining nearest neighbours.
Find the nearest neighbour instead of just the node at the smallest distance
Need to subtract the average distance to all other leaves.
Let \[ D_{ij} = d_{ij} - (r_i + r_j) \] where \[ r_i = \frac1{|L| - 2} \sum_{k\in L\setminus i} d_{ik} \] where \(L\) is the number of leaves (sequences).
Can shown (omitted) that the pair of leaves \(i,j\) for which \(D_{ij}\) is minimal are neighbouring leaves.
NJ algorithm progressively builds up a tree \(T\) by keeping a list of active nodes \(L\) and finding the closest amongst them.
\[ d = \begin{array}{ccccc} & A & B & C & D \\ A & 0 & 0.6 & 0.8 & 1.2 \\ B & & 0 & 0.4 & 0.8 \\ C & & & 0 & 0.6 \\ D & & & & 0 \end{array}\]
Need to calculate \[ D_{ij} = d_{ij} - (r_i + r_j) \mbox{ where } r_i = \frac1{|L| - 2} \sum_{k\in L\setminus i} d_{ik} \]
Here \(L = 4\), so
\[ r_A = \frac 1 2 (d_{AB} + d_{AC} + d_{AD}) = \frac 1 2 (0.6 + 0.8 + 1.2) = 2.6/2 = 1.3. \]
Others are similar to get \(r = (1.3, 0.9, 0.9, 1.3)\).
\[ d = \begin{array}{ccccc} & A & B & C & D \\ A & 0 & 0.6 & 0.8 & 1.2 \\ B & & 0 & 0.4 & 0.8 \\ C & & & 0 & 0.6 \\ D & & & & 0 \end{array}\]
\(r = (1.3, 0.9, 0.9, 1.3)\)
From \(r\) and \(d\) we can thus calculate \[ D= \begin{array}{ccccc} & A & B & C & D \\ A & &-1.6 &-1.4 &-1.4 \\ B & & &-1.4 &-1.4 \\ C & & & &-1.6\\ D & & && \end{array} \]
The minimum value \(AB\) (and \(CD\)).
Choose \(AB\) to merge into new node \(E\).
The length of the edge from \(A\) to \(E\) is \[d_{AE} = \frac12(d_{AB} + r_A - r_B) = \frac12(0.6 + 1.3 - 0.9 ) = 0.5\]
The length of the edge from \(B\) to \(E\) \[ d_{AE} = d_{AB} - d_{AE} = 0.6 - 0.5 = 0.1\]
\(A\) and \(B\) can now be removed from the leaf set and replaced with \(E\) and a new rate adjusted matrix \(D\) derived.
Further iterations produce the original tree.
Both have space complexity of \(O(n^2)\).
Time complexity for UPGMA is also \(O(n^2)\)
Time complexity for NJ is \(O(n^3)\)
Heuristics exisit to speed up NJ can, in many cases, do better.
Worst case complexity remains \(O(n^3)\).
See M. Simonsen, T. Mailund and C.N.S. Pedersen (2008). Rapid Neighbor-Joining for description of heuristics.
NJ produces and unrooted tree — we don't know the direction of evolution.
An unrooted binary tree with \(n\) leaves has \(2n-3\) branches. The root could be along any branch.
To decide where the actual root lies, often include an outgroup — a taxon that is known to be distinct from others.
Formally, if the group of taxa \(T\) is being studied, an outgroup with respect to \(T\) is a taxon or a group of taxa that is related to \(T\) but the most recent common ancestor of \(T\) is not ancestral any of the taxa in the outgroup.
Number of unrooted trees: \[\frac{(2n-5)!}{2^{n-3}(n-3)!}\]
Number of rooted trees: \[\frac{(2n-3)!}{2^{n-2}(n-2)!}\]
## leaves nUnrootedTrees nRootedTrees ## 3 1 3 ## 4 3 15 ## 5 15 105 ## 6 105 945 ## 7 945 10395 ## 8 10395 135135 ## 9 135135 2027025 ## 10 2027025 34459425 ## 11 34459425 654729075 ## 12 654729075 13749310575 ## 13 13749310575 316234143225 ## 14 316234143225 7905853580625 ## 15 7905853580625 213458046676875 ## 16 213458046676875 6190283353629374 ## 17 6190283353629374 191898783962510624 ## 18 191898783962510624 6332659870762850304 ## 19 6332659870762850304 221643095476699758592 ## 20 221643095476699758592 8200794532637890838528
Distance based methods use only a coarse summary of the data.
Reduce full \(n \times L\) data matrix (\(n\) taxa,sequences of length \(L\)) to \((n \times n)\) distance matrix.
May produce close to correct tree if data is well-bahaved.
But:
Data is not usually well-behaved (rate variation, non-treelike signals, complex mutation)
no reliable measure of uncertainty (could use bootstrap but doesn't work well)
model is not explicit