CS 369: Multiple sequence alignment

9 May 2017

Multiple sequence alignment

MSA via dynamic programming

Conceptually easy (recusion etc)
But resources required for naive implementation quickly get large:

Needleman-Wunsch is \(O(n^2)\) in space and time at worst

Extending to \(k\) sequences of length \(n\) requires storing and calculating matrix of size \(n^k\).

eg 5 seqeunces of of length 100 produces array with \(10^5 = 10^{10}\) cells, 4 bytes per cell means need approx 37GB just to store

Real examples may involve 1000s of sequences each with 1000s of sites.

Some clever approaches (calculating upper and lower bounds on score) can limit size of space searched but only work for small problems (say \(5-10\) seqs of \(<500\) bases)

Give up on finding optimal alignment and try heuristics to find good enough alignment.

MSA via progressive alignment

Most widely used technique for MSA

Involves a series of pairwise alignments.

Basic idea: choose a pair of sequences and align them, choose a third and align it to either of first two, and so on until all sequences aligned.

May want to align two MSA together: eg when aligning 4 sequences, might align two pairs first then align the resulting alignments together.

MSA via progressive alignment

Issues to address:

Order in which to align the sequences
How to align two sequences together
How to align a sequence to a MSA
How to align two MSAs together

Building trees with UPGMA

UPGMA: Given objects (sequences) and distances between each pair of objects (a distance matrix), UPGMA builds a tree where distance along tree between a pair is same as distance in matrix.

UPGMA: Unweighted Pair Group Method using Arithmetic Averages

Basic idea:

Start with each object in own cluster and assign leaf node to each cluster
Merge closest cluster together and assign (internal) node at height half the distance
if number of clusters is greater than two, go to 2., else stop.

UPGMA

Given \(n\) objects and \(n \times n\) distnace matrix \(d = [d_{ij}]\) where \(d_{ij}\) is distance between \(i\) and \(j\) th object

Define distance between two clusters (sets) of objects \(C_i\) and \(C_j\) as the average distance between all pairs between clusters: \[ d_{ij} = \frac1{|C_i| |C_j|}\sum_{x \in C_i,y \in C_j} d_{xy}\] \(|C|\) is the number of sequences in cluster \(C\).

UPGMA algorithm

Initialise: Assign each sequence \(i\) to it's own cluster \(C_i\). Assign a leaf node to each cluster at height 0.

Repeat until only one cluster remains:

Find clusters \(C_i\) and \(C_j\) such that \(d_{ij}\) is minimal (choose randomly between equidistant candidates).
Join \(i\) and \(j\) to make the new cluster \(C_k = C_i \cup C_j\).
Define a node \(k\) in the tree placed at height \(d_{ij}/2\) with child nodes \(i\) and \(j\).
Update the distance matrix using \[ d_{kl} = \frac1{|C_k| |C_l|}\sum_{x \in C_k,y \in C_l} d_{xy}\]

UPGMA example

Given 4 sequences, \(A,B,C\) and \(D\) and distance matrix \(d\), construct the UPGMA tree.

\[ d= \begin{array}{ccccc} & A& B&C&D \\ A &-&4 & 8 & 8 \\ B & & - & 8 & 8 \\ C & & & - & 6 \\ D &&&& - \end{array} \]

First step is to make 4 clusters and assign each a node at 0

Pair \((A,B)\) with distance \(d(A,B) = 4\) is closest.

Join the cluster \(E = A \cup B = \{A,B\}\) which has height \(d(A,B)/2 = 2\)

Recalculate distance matrix to find \(d(E,C)\) and \(d(E,D)\).

\(d(E,C) = \frac1{2\cdot 1}(d(A,C) + d(B,C)) = \frac 1 2 (8+8) = 8\)

and similarly for \(d(E,D)\). \[ \begin{array}{cccc} & E&C&D \\ E & - & 8 & 8 \\ C & & - & 6 \\ D &&& - \end{array} \]

Now form the cluster \(F = \{C,D\}\) and place the node at \(d(C,D)/2 = 3\).

The distance matrix is now the single distance between the remaining clusters: \(d(E,F) = \frac 1 {2\cdot 2} (d(A,C)+d(A,D) + d(B,C)+d(B,D) ) = \frac 1 4 (8+8+8+8) = 8\)

So make the last node \(G = \{E,F\}\) and place it at height \(8/2 = 4\).

Resulting UPGMA tree is

Feng-Doolittle progressive alignment

Steps for aligning \(n\) sequences are:

Calculate the \(n(n-1)/2\) distances between all sequences pairs.
Build a guide tree based on the recorded scores (we use UPGMA).
Build the alignment in the order that nodes were added to the tree.

Distance used in Feng-Doolittle algorithm

The distances are found by aligning each pair and recording a normalized score.

The normalized score is \[D = -\log S_{eff} = -\log \frac{S_{obs} - S_{rand}}{S_{max} - S_{rand}}\] where

\(S_{obs}\) is the score from the pairwise alignment
\(S_{max}\) is the average of scores obtained by aligning each sequences of the pair to itself
\(S_{rand}\) is the expected score for an alignment of the pair when the residues are randomly shuffled.

\(S_{eff}\) is roughly a normalised percentage similarity which decays exponentially towards zero with increasing evolutionary distance.

Taking \(-\log(S_{eff})\) tomake the score increase approximately linearly with evolutionary distance.

Aligning sequences and MSAs in Feng-Doolittle

A pair of sequences is aligned in the normal way.
A sequence is aligned to an MSA by aligning it to each sequence in the MSA and choosing the highest scoring alignment.
Two alignments are aligned to each other by aligning all pairs of sequences between the two groups and choosing the best alignment.

After alignment completed at each step, gap characters are replaced with a neutral \(X\) character which can be aligned to any other character (gap or residue) with no cost. Tends to cluster gaps together.

Can improve inital MSA by making random adjustments, eg remove a sequence at random and re-align it.