CS 369: Viterbi and forward algorithms

12 May 2017

Viterbi algorithm (log units)

Initialisation \(i = 0\): \(V_0(0) = 0\), \(V_k(0) = -\infty\) for \(k>0\).

Recursion \(i = 1,\ldots,L\): \(V_l(i) = E_l(x_i) + \max_k(V_k(i-1) + A_{kl})\).

\(Pointer_i(l) = \arg \max_k(V_k(i-1) +A_{kl})\)

Termination: \(P(x,\pi^*) = \max_k (V_k(L) + A_{k0})\).

Traceback \(i = L,\ldots,1\): \(\pi^*_L = \arg \max_k(V_k(L) + A_{k0})\).

\(\pi_{i-1}^* = Pointer_i(\pi_i^*)\).

The key part of the algorithm is calculating the recursion \[V_l(i+1) = E_l(x_{i+1}) + \max_k (V_k(i)+ A_{kl}) \]

Think of \(V_k(i) = V(k,i)\) as a matrix. Row indices are states, column indices are positions along sequence.

Viterbi example: CpG islands

Methylation changes the nucleotide pair CG (written CpG) to TG, so CpG is rarer than expected in genome.

But in functional regions, methylation is less common so we see more CpGs there. These regions are known as CpG islands.

Model each part of a sequence as having either high or low CpG content.

Label the regions \(H\) (for high CpG content) and \(L\) (for low).

Viterbi example: CpG islands

In high CpG regions, C and G each occur with probability 0.35 and T and A with probability 0.15.

In low CpG regions, C and G are probability 0.2 and T, A are 0.3.

The initial state is H with probability 0.5 and L with probability 0.5.

Transitions are given by \(a_{HH} = a_{HL} = 0.5, \, a_{LL} = 0.6, a_{LH} = 0.4\).

Find the most likely state sequence given the sequence of symbols \(x = GGCACTGAA\).

Viterbi example: CpG islands

Work in log units (here use log base 2, can use any base, usually base e easiest):

set \(\mathbf A = \log_2(a)\) and \(\mathbf E = \log_2(e)\) \[\mathbf A = \begin{array}{ccc} & H & L \\ H & \log(a_{HH}) = -1 & -1 \\ L & -1.322 & -0.737 \end{array} \]

\[\mathbf E = \begin{array}{ccccc} & A & C&G&T\\ H & \log(e_H(A)) = -2.737 & -1.515 & -1.515 & -2.737 \\ L & -1.737 & -2.322 & -2.322 & -1.737 \end{array} \]

Recursion is: \(V_l(i) = E_l(x_i) + \max_k(V_k(i-1) + A_{kl})\)

Want to fill out:

Use \(V_l(i) = E_l(x_i) + \max_k(V_k(i-1) + A_{kl})\)

Viterbi example: CpG islands

Traceback begins in the final column by finding state that maximises the joint probability, \(L\) in this case.

Following the pointers from this position and recording the state at each step gives us the state path with the highest probability is \[\pi^\ast = HHHLLLLLL\].

Viterbi algorithm examples

Recall the cheating casino which had a fair die (F) and a loaded die (L). The loaded die rolls a six 50% of the time.

Rolls (symbol sequence)
2 5 1 6 6 2 4 2 4 5 6 6 3 6 2 4 2 2 3 4 6 3 6 5 3 4 5 2 3 5 1 6 6 6 6 2 4 6 6 2 6 6 6 6 6 1 5 1 6 4 1 2

Reconstructed state seqeunce using Viterbi algoritm
F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F L L L L L L L L L L L L L L F F F F F F F

States (true state seqeunce)
F F F F L F F F L L L L L L L L F F F F F F F F F F F F F F L L L L F F F F L L L L L L F F F F F F F F

Automatic Chord Recognition – Chordec

Video of Chordec app in use

Same with a different song

Paper: R. Chen, W. Shen, A. Srinivasamurthy, P. Chordia: "Chord Recognition Using Duration-Explicit Hidden Markov Models", in International Society of Music Information Retrieval Conference (ISMIR), Oct 2012

Calculating the probability of x

If \(x\) and \(\pi\) known, easy to calculate \(P(x,\pi)\).

But only observe \(x\), not \(\pi\) so what do we do?

\(P(x,\pi^*)\) is not much use as there may be loads of possible state paths \(\pi\)

Want to marginalise over the state path, that is, calculate \[P(x) = \sum_{\pi} P(x,\pi)\]

But this sum contains a huge number of terms (\(K^L\) if \(x\) is \(L\) long and there are \(K\) states)

Dynamic programming to the rescue…

Forward algorithm

Let \[f_k(i) = P(x_{1:i},\pi_i = k)\]

So \(f_k(i)\) is the joint probability of the first \(i\) observations and the path being in state \(k\) at the \(i\) th step.

Recursion for \(f\): \[f_l(i+1) = e_l(x_{i+1}) \sum_k f_k(i)a_{kl}\]

Always start in state 0 at step 0, so \(f_0(0) = 1\) and \(f_k(0) = 0 \mbox{ for } k \neq 0.\)

Forward algorithm

Initialisation \(i = 0\): \(f_0(0) = 1\), \(f_k(0) = 0\) for \(k>0\).

Recursion \(i = 1,\ldots,L\): \(f_l(i) = e_l(x_i) \sum_k f_k(i-1)a_{kl})\).

Termination: \(P(x) = \sum_k f_k(L)a_{k0}\).

If the end state \(0\) is not modelled, set \(a_{k0} = 1\) to get \[P(x) = \sum_k f_k(L).\]

Think of \(f_k(i)\) as the matrix \(f(k,i)\) and calculate column by column:

Forward algorithm in log units

As always, prefer to work in log units.

Using notation \(A_{kl} = \log a_{kl}\) etc, main recurion in log units is

\[ F_l(i) = \log \left[ e_l(x_i) \sum_k f_k(i-1)a_{kl})\right] = E_l(x_i) + \log \left[\sum_k f_k(i-1)a_{kl})\right] \] But can't take the log through the sum, so if only storing \(F\) and \(A\) could calcualte RHS as \[ E_l(x_i) + \log \left[\sum_k \exp(F_k(i-1) + A_{kl})\right]\] but this involves forming exponential of large negative number which is exactly what want to avoid.

Log sum

Want \(\log(e^a + e^b)\) without forming \(e^a\) or \(e^b\).

Handy trick: \[ \log(e^a + e^b) = \log(e^a(1 + e^{b-a}) )\] \[ = \log(e^a) + \log(1 + e^{b-a}) = a + \log(1 + e^{b-a}).\]

If \(|b-a|\) is small enough, never need store an extremely large or small number so stable.

Call this the logsum of \(a\) and \(b\)

Log sum

Extends to finding the log of a sum of multiple logged numbers:

function logsum(x)
  z = 0
  for i in x
    z += exp(i - x[0])
  return x[0] + log(z)

or, using \(\log_2\),

function logsum(x)
  z = 0
  for i in x
    z += 2^(i - x[0])
  return x[0] + log2(z)

Forward algorithm in log units

Initialisation \(i = 0\): \(F_0(0) = 0\), \(F_k(0) = -\infty\) for \(k>0\).

Recursion \(i = 1,\ldots,L\): \(F_l(i) = E_l(x_i) + logsum_k\left(F_k(i-1) + A_{kl}\right)\).

Termination: \(P(x) = logsum_k \left(F_k(L) + A_{k0} \right)\).

If the end state \(0\) is not modelled, set \(a_{k0} = 1\) to get \[P(x) = logsum_k (F_k(L)).\]

Forward algorithm example: CpG islands cont.

x = GGCACTGAA

a ## transition matrix

##     H   L
## H 0.5 0.5
## L 0.4 0.6

e ## emission matrix

##      A    C    G    T
## H 0.15 0.35 0.35 0.15
## L 0.30 0.20 0.20 0.30

## do calcualtion without using log units
forwardMatrixNoLog = forwardNoLog(seq,a,e)
forwardMatrixNoLog

##       [,1]     [,2]       [,3]        [,4]         [,5]         [,6]
## [1,] 0.175 0.044625 0.01193937 0.001375603 0.0006931204 8.350342e-05
## [2,] 0.100 0.029500 0.00800250 0.003231356 0.0005253231 1.985262e-04
##              [,7]         [,8]         [,9]
## [1,] 4.240677e-05 5.110917e-06 1.112453e-06
## [2,] 3.217349e-05 1.215224e-05 2.954041e-06

## probability is sum of last column
sum(forwardMatrixNoLog[,9])

## [1] 4.066495e-06

## use log base 2
forwardMatrix = forward(seq,a,e)
forwardMatrix

##           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
## [1,] -2.514573 -4.486004 -6.388129 -9.505720 -10.49461 -13.54781 -14.52535
## [2,] -3.321928 -5.083141 -6.965334 -8.273644 -10.89451 -12.29838 -14.92377
##           [,8]      [,9]
## [1,] -17.57799 -19.77782
## [2,] -16.32842 -18.36888

## log probability is logsum of last column
logsum(forwardMatrix[,9])

## [1] -17.90778

log2(sum(forwardMatrixNoLog[,9])) ## compare calculations

## [1] -17.90778

Compare to joint probablity \(\log_2(P(x, \pi^*))\) of the observations and best path \(\pi^*\) from Viterbi algorithm:

viterbi(seq,a,e)[[1]]

##           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
## [1,] -2.514573 -5.029146 -7.543720 -11.28069 -13.11719 -16.85415 -18.65001
## [2,] -3.321928 -5.836501 -8.351074 -10.28069 -13.33958 -15.81351 -18.87240
##           [,8]      [,9]
## [1,] -22.38698 -25.40523
## [2,] -21.34633 -23.82027