CS 369: More on pairwise alignment

5 May 2017

Elements of alignment algorithm

a recurrence relation for the quantity we are trying to optimise
boundary conditions
tabular computing to efficiently calculate the recurrence
traceback, includes rules for where to start and stop traceback.

Local Alignment

Local alignment: Smith Waterman algorithm

Want to find best alignment for subsequences of \(x\) and \(y\).

Fix \(i\leq m\) and \(j \leq m\).

Define a Suffix alignment: any alignment of \[x_s x_{s+1} \ldots x_i\] and \[y_r y_{r+1} \ldots y_j\] for some \(1 \leq s \leq i\) and \(1 \leq r \leq j\).

Let \(F(i,j) =\) highest score of any suffix alignment of \(x_1 x_2 \ldots x_i\) and \(y_1 y_2 \ldots y_j\)

Since score of empty alignment is 0, \(F(i,j) \geq 0\)

So instead of a letting a score for an alignment to go negative, we start a new alignment.

\[ F(i,j) = \max \begin{cases} 0\\ F(i-1,j-1)+s(x_i,y_i), \\ F(i-1,j)-d, \\ F(i,j-1)-d. \end{cases} \]

Traceback Starts at highest value in \(F\) matrix, trace it back until we hit first 0 with no predecesor.

Local alignment example

Find best local alignment of x = ATA and y = AGTTA with same gap penalty and score matrix as before.

Want to fill out

Highest score is at \(F(3,5)\), so start traceback from there and stop when hit first zero to get \[ \begin{array}{cc} T & A \\ T & A \end{array}\]

Overlap alignment

Often have seqeunces x and y are related like:

y = --------------------ATACTGGGGCACTGGGTACA
x = AAGTAAACCGCATAACGAGTATACTAGGGC----------

y = GGCGGTTAAATGCCGATCAACAGAGTACGTATCTT-----
x = ----------------------GAGTACGTATCTTGTCGG

y = ------TCACCGGACGTGGGTCCTGGATTTT---------
x = AAGCACTCACGGGACGAGGGTCCTGGGTTTTGCCGATTCG

y = AAGCACTCACGGGACGAGGGTCCTGGGTTTTGCCGATTCG
x = -----GTCACGGGACGTGGGTCCTGGGA------------

Want a global alignment algorithm that doesn't penalise the preceding or succeeding gaps.

Overlap alignment

Use global aligment algorithm and just change boundary conditions and traceback:

Set \(F(i,0) = F(0,j) = 0\) for all \(i\) and \(j\) so that overlapping ends on the left are not penalised.
Use the same recurrence relation as for the Needleman-Wunsch algorithm
Start traceback at maximum value on bottom/right boundaries, \(F(n,j)\) or \(F(i,m)\), some \(i\) or \(j\).
Stop traceback when top or right boundary is reached.

More complex gap penalities

Linear gap penalty results in an unrealisitc distribution of gaps.

With an artibtrary gap penalty \(\gamma(k)\), get global recurrence relation:

\[ F_{i,j}= \max \begin{cases} F(i-1,j-1)+s(x_i,y_j), \\ F(k,j) + \gamma(i - k), \hspace{1cm} k = 0,\ldots,i-1, \\ F(i,k) + \gamma(j-k),\hspace{1cm} k = 0,\ldots,j-1 . \end{cases} \]

More complex gap penalities

\[ F_{i,j}= \max \begin{cases} F(i-1,j-1)+s(x_i,y_j), \\ F(k,j) + \gamma(i - k), \hspace{1cm} k = 0,\ldots,i-1, \\ F(i,k) + \gamma(j-k),\hspace{1cm} k = 0,\ldots,j-1 . \end{cases} \]

For \(F(i,j)\) need to consider \(i+j+1\) other cells: diagonal + everything same row and column. Result is \(O(n^3)\), not \(O(n^2)\) in speed

Affine gap alignment: \(\gamma(k) = -d -(k-1)e\)

A \(O(n^2)\) approach is available here but need to use more space.

\(M(i,j)\) is score of best alignment up to \((i,j)\) given that \(x_i\) is aligned to \(y_j\).
```
A C C xi
A C G yj
```
\(I_x(i,j)\) is score of best alignment up to \((i,j)\) given that \(x_i\) is aligned to a gap.
```
A C  C xi
G yj - -
```
\(I_y(i,j)\) is score of best alignment up to \((i,j)\) given that \(y_j\) is aligned to a gap.
```
A C xi  -
A C G  yj
```

Assume a gap never directly follows an insertion (so can't go direct \(I_x \leftrightarrow I_y\))

\[ M(i,j)= \max \begin{cases} M(i-1,j-1)+s(x_i,y_j), \\ I_x(i-1,j-1)+s(x_i,y_j), \\ I_y(i-1,j-1)+s(x_i,y_j); \end{cases} \] \[ I_x(i,j)= \max \begin{cases} M(i-1,j)-d, \\ I_x(i-1,j)-e; \end{cases} \mbox{and}, \] \[ I_y(i,j)= \max \begin{cases} M(i,j-1)-d, \\ I_y(i,j-1)-e. \end{cases} \]

Can again apply tabular computing this time using 3 arrays (for \(M, \, I_x\) and \(I_y\)) and traceback as before.

Algorithm is still \(O(n^2)\) in time and space but with higher coefficents (as need to store and calculate 3 arrays not 1).

Affine alignment as a finite state automaton

States \(M\), \(I_x\) and \(I_y\) shown as cirlces.

Transitions betwen states via arrows and incur cost shown next to arrow.

Each state increments position on sequences \(x\) and \(y\) by amount shown

An alignment is a path through the states

V  L  S  P  A  D  -  K
H  L  -  -  A  E  S  K

M  M  Ix Ix M  M  Iy M

Finite state automata are known as Moore machines or Mealy Machines in Computer Science.

Markov chains and Hidden Markov Models (coming soon…) are stochastic FSAs

Finite state automata are known as Moore machines or Mealy Machines in Computer Science.

Markov chains and Hidden Markov Models (coming soon…) are stochastic FSAs

Complexity of pairwise alignment algorithms

Naively, quadratic in time and space: \(O(mn)\).

If only need score but not alignment itself, easy to do with linear space: \(O(m)\) (still \(O(mn)\) time)

With a little work, can get alignment in linear space too. Details in notes.

Pairwise alignment in practice

If \(m\) and \(n\) not too large, we use these algorithms.

Commonly, have sequence \(x\) and want to find matches in whole database. So \(m\) small but \(n\) very large. Need to use approximations to the exact method.

BLAST (Basic Local Alignment Search Tool) is the most common. See https://blast.ncbi.nlm.nih.gov/Blast.cgi