9 May 2017
Conceptually easy (recusion etc)
But resources required for naive implementation quickly get large:
Most widely used technique for MSA
Involves a series of pairwise alignments.
Basic idea: choose a pair of sequences and align them, choose a third and align it to either of first two, and so on until all sequences aligned.
May want to align two MSA together: eg when aligning 4 sequences, might align two pairs first then align the resulting alignments together.
Issues to address:
UPGMA: Given objects (sequences) and distances between each pair of objects (a distance matrix), UPGMA builds a tree where distance along tree between a pair is same as distance in matrix.
UPGMA: Unweighted Pair Group Method using Arithmetic Averages
Basic idea:
Given \(n\) objects and \(n \times n\) distnace matrix \(d = [d_{ij}]\) where \(d_{ij}\) is distance between \(i\) and \(j\) th object
Define distance between two clusters (sets) of objects \(C_i\) and \(C_j\) as the average distance between all pairs between clusters: \[ d_{ij} = \frac1{|C_i| |C_j|}\sum_{x \in C_i,y \in C_j} d_{xy}\] \(|C|\) is the number of sequences in cluster \(C\).
Initialise: Assign each sequence \(i\) to it's own cluster \(C_i\). Assign a leaf node to each cluster at height 0.
Repeat until only one cluster remains:
Given 4 sequences, \(A,B,C\) and \(D\) and distance matrix \(d\), construct the UPGMA tree.
\[ d= \begin{array}{ccccc} & A& B&C&D \\ A &-&4 & 8 & 8 \\ B & & - & 8 & 8 \\ C & & & - & 6 \\ D &&&& - \end{array} \]
First step is to make 4 clusters and assign each a node at 0
Pair \((A,B)\) with distance \(d(A,B) = 4\) is closest.
Join the cluster \(E = A \cup B = \{A,B\}\) which has height \(d(A,B)/2 = 2\)
Recalculate distance matrix to find \(d(E,C)\) and \(d(E,D)\).
Recalculate distance matrix to find \(d(E,C)\) and \(d(E,D)\).
\(d(E,C) = \frac1{2\cdot 1}(d(A,C) + d(B,C)) = \frac 1 2 (8+8) = 8\)
and similarly for \(d(E,D)\). \[ \begin{array}{cccc} & E&C&D \\ E & - & 8 & 8 \\ C & & - & 6 \\ D &&& - \end{array} \]
Now form the cluster \(F = \{C,D\}\) and place the node at \(d(C,D)/2 = 3\).
The distance matrix is now the single distance between the remaining clusters: \(d(E,F) = \frac 1 {2\cdot 2} (d(A,C)+d(A,D) + d(B,C)+d(B,D) ) = \frac 1 4 (8+8+8+8) = 8\)
So make the last node \(G = \{E,F\}\) and place it at height \(8/2 = 4\).
Resulting UPGMA tree is
Steps for aligning \(n\) sequences are:
The distances are found by aligning each pair and recording a normalized score.
The normalized score is \[D = -\log S_{eff} = -\log \frac{S_{obs} - S_{rand}}{S_{max} - S_{rand}}\] where
\(S_{eff}\) is roughly a normalised percentage similarity which decays exponentially towards zero with increasing evolutionary distance.
Taking \(-\log(S_{eff})\) tomake the score increase approximately linearly with evolutionary distance.
After alignment completed at each step, gap characters are replaced with a neutral \(X\) character which can be aligned to any other character (gap or residue) with no cost. Tends to cluster gaps together.
Can improve inital MSA by making random adjustments, eg remove a sequence at random and re-align it.