- a recurrence relation for the quantity we are trying to optimise
- boundary conditions
- tabular computing to efficiently calculate the recurrence
- traceback, includes rules for where to start and stop traceback.
5 May 2017
Want to find best alignment for subsequences of \(x\) and \(y\).
Fix \(i\leq m\) and \(j \leq m\).
Define a Suffix alignment: any alignment of \[x_s x_{s+1} \ldots x_i\] and \[y_r y_{r+1} \ldots y_j\] for some \(1 \leq s \leq i\) and \(1 \leq r \leq j\).
Let \(F(i,j) =\) highest score of any suffix alignment of \(x_1 x_2 \ldots x_i\) and \(y_1 y_2 \ldots y_j\)
Since score of empty alignment is 0, \(F(i,j) \geq 0\)
So instead of a letting a score for an alignment to go negative, we start a new alignment.
\[ F(i,j) = \max \begin{cases} 0\\ F(i-1,j-1)+s(x_i,y_i), \\ F(i-1,j)-d, \\ F(i,j-1)-d. \end{cases} \]
Find best local alignment of x = ATA and y = AGTTA with same gap penalty and score matrix as before.
Want to fill out
Highest score is at \(F(3,5)\), so start traceback from there and stop when hit first zero to get \[ \begin{array}{cc} T & A \\ T & A \end{array}\]
Often have seqeunces x and y are related like:
y = --------------------ATACTGGGGCACTGGGTACA x = AAGTAAACCGCATAACGAGTATACTAGGGC----------
y = GGCGGTTAAATGCCGATCAACAGAGTACGTATCTT----- x = ----------------------GAGTACGTATCTTGTCGG
y = ------TCACCGGACGTGGGTCCTGGATTTT--------- x = AAGCACTCACGGGACGAGGGTCCTGGGTTTTGCCGATTCG
y = AAGCACTCACGGGACGAGGGTCCTGGGTTTTGCCGATTCG x = -----GTCACGGGACGTGGGTCCTGGGA------------
Want a global alignment algorithm that doesn't penalise the preceding or succeeding gaps.
Use global aligment algorithm and just change boundary conditions and traceback:
Linear gap penalty results in an unrealisitc distribution of gaps.
With an artibtrary gap penalty \(\gamma(k)\), get global recurrence relation:
\[ F_{i,j}= \max \begin{cases} F(i-1,j-1)+s(x_i,y_j), \\ F(k,j) + \gamma(i - k), \hspace{1cm} k = 0,\ldots,i-1, \\ F(i,k) + \gamma(j-k),\hspace{1cm} k = 0,\ldots,j-1 . \end{cases} \]
\[ F_{i,j}= \max \begin{cases} F(i-1,j-1)+s(x_i,y_j), \\ F(k,j) + \gamma(i - k), \hspace{1cm} k = 0,\ldots,i-1, \\ F(i,k) + \gamma(j-k),\hspace{1cm} k = 0,\ldots,j-1 . \end{cases} \]
For \(F(i,j)\) need to consider \(i+j+1\) other cells: diagonal + everything same row and column. Result is \(O(n^3)\), not \(O(n^2)\) in speed
A \(O(n^2)\) approach is available here but need to use more space.
\(M(i,j)\) is score of best alignment up to \((i,j)\) given that \(x_i\) is aligned to \(y_j\).
A C C xi A C G yj
\(I_x(i,j)\) is score of best alignment up to \((i,j)\) given that \(x_i\) is aligned to a gap.
A C C xi G yj - -
\(I_y(i,j)\) is score of best alignment up to \((i,j)\) given that \(y_j\) is aligned to a gap.
A C xi - A C G yj
Assume a gap never directly follows an insertion (so can't go direct \(I_x \leftrightarrow I_y\))
\[ M(i,j)= \max \begin{cases} M(i-1,j-1)+s(x_i,y_j), \\ I_x(i-1,j-1)+s(x_i,y_j), \\ I_y(i-1,j-1)+s(x_i,y_j); \end{cases} \] \[ I_x(i,j)= \max \begin{cases} M(i-1,j)-d, \\ I_x(i-1,j)-e; \end{cases} \mbox{and}, \] \[ I_y(i,j)= \max \begin{cases} M(i,j-1)-d, \\ I_y(i,j-1)-e. \end{cases} \]
Can again apply tabular computing this time using 3 arrays (for \(M, \, I_x\) and \(I_y\)) and traceback as before.
Algorithm is still \(O(n^2)\) in time and space but with higher coefficents (as need to store and calculate 3 arrays not 1).
States \(M\), \(I_x\) and \(I_y\) shown as cirlces.
Transitions betwen states via arrows and incur cost shown next to arrow.
Each state increments position on sequences \(x\) and \(y\) by amount shown
An alignment is a path through the states
V L S P A D - K H L - - A E S K M M Ix Ix M M Iy M
Finite state automata are known as Moore machines or Mealy Machines in Computer Science.
Markov chains and Hidden Markov Models (coming soon…) are stochastic FSAs
Finite state automata are known as Moore machines or Mealy Machines in Computer Science.
Markov chains and Hidden Markov Models (coming soon…) are stochastic FSAs
Naively, quadratic in time and space: \(O(mn)\).
If only need score but not alignment itself, easy to do with linear space: \(O(m)\) (still \(O(mn)\) time)
With a little work, can get alignment in linear space too. Details in notes.
If \(m\) and \(n\) not too large, we use these algorithms.
Commonly, have sequence \(x\) and want to find matches in whole database. So \(m\) small but \(n\) very large. Need to use approximations to the exact method.
BLAST (Basic Local Alignment Search Tool) is the most common. See https://blast.ncbi.nlm.nih.gov/Blast.cgi