CS 369: Tree likelihood and Substitution models

1 June 2017

Last time: Markov process for substitution models

Model value of a site of DNA sequence by a Markov process, \(X(t)\)

Possible states: \(X(t) \in \{A,C,G,T \}\))

General Markov process has time between jumps depends on the state we are in (contrast with Poisson process)

E.g., time spent in \(A\) may be different from time spent in \(C\).

Last time: Time spent in states

All times between jumps are exponentially distributed.

Let \(q_{ab}\) be the rate of transitions from state \(a\) to state \(b\), for any states \(a \neq b\).

Total rate of transitions from \(a\) is \[q_{aa} = \sum_{b \neq a} q_{ab}.\]

So length of time spent in state \(a\) is exponentially distributed with rate \(q_{aa}\)

At a jump, move into state \(b\) with probability \[\frac{q_{ab}}{\sum_{k \neq a} q_{ak}} = \frac{q_{ab}}{ q_{aa}}\]

Rate matrices for DNA substitution

Want a stationary, reversible process. Already seen stationarity

Reversible means the processes looks the same running forward as it does running backwards.

All the models listed below are reversible

Jukes-Cantor model

\[ Q = \begin{bmatrix} -1 & 1/3 & 1/3 & 1/3 \\ 1/3 & -1 & 1/3 & 1/3 \\ 1/3 & 1/3 & -1 & 1/3 \\ 1/3 & 1/3 & 1/3 & -1 \end{bmatrix}. \]

The equilibrium of this process is \(\pi = (1/4,1/4,1/4,1/4)\).

Kimura Model

\[ Q = \begin{bmatrix} -2-\kappa & 1 & \kappa & 1 \\ 1 & -2-\kappa & 1 & \kappa \\ \kappa & 1 & -2-\kappa & 1 \\ 1 & \kappa & 1 & -2-\kappa \end{bmatrix} \]

\(\kappa\) is the transition/transversion ratio: the preference for seeing purine to purine or pyrimidine to pyrimidine mutations over purine to pyrimidine mutations.

The equilibrium of this process is \(\pi = (1/4,1/4,1/4,1/4)\).

HKY Model

\[ Q = \begin{bmatrix} * & \pi_C & \kappa\pi_G & \pi_T \\ \pi_A & * & \pi_G & \kappa\pi_T \\ \kappa\pi_A & \pi_C & * & \pi_T \\ \pi_A & \kappa\pi_C & \pi_G & * \end{bmatrix} \]

Extends Kimura model to have allow arbitrary equilibrium distribution \(\pi = [\pi_A , \pi_C , \pi_G , \pi_T]\).

General time reversible (GTR) model and scaling

\[ \renewcommand{\arraystretch}{1.25} Q = \left[ \begin{array}{cccc} * & a\pi_C & b\pi_G & c\pi_T \\ a \pi_A & * & d\pi_G & e\pi_T \\ b \pi_A & d\pi_C & * & f\pi_T\\ c \pi_A & e\pi_C & f\pi_G & * \end{array} \right].\]

Scale the rate matrix so that get, on average, one substitution per unit time. Given unscaled matrix, \(\hat Q\), can choose \(\beta\) so that \[ Q = \beta \hat Q\] has this property. Time measured in expected substitutions.

Let \(\mu\) be mutation rate parameter to relate this time scale to real time. Given a scaled \(Q\) and \(\mu\): \[P(t) = \exp (\mu t Q).\]

Example: Q matrix

##      A    C    G    T
## A -0.8  0.2  0.5  0.1
## C  0.2 -0.7  0.2  0.3
## G  0.5  0.2 -0.9  0.2
## T  0.1  0.3  0.2 -0.6

Example: Probability transition matrices

\(P(1) = \exp(1Q)\):

##      A    C    G    T
## A 0.52 0.14 0.24 0.10
## C 0.14 0.55 0.14 0.18
## G 0.24 0.14 0.49 0.13
## T 0.10 0.18 0.13 0.59

\(P(2) = \exp(2Q)\):

##      A    C    G    T
## A 0.36 0.20 0.28 0.16
## C 0.20 0.37 0.20 0.24
## G 0.28 0.20 0.33 0.19
## T 0.16 0.24 0.19 0.41

Stationary distribution is a row of \(P(\infty) = \exp(\infty Q)\): \(\pi=\)[0.25, 0.25, 0.25, 0.25]

Finding the best tree: Maximum likelihood

Initialise: choose an initial tree

Iterate: until no improvement occurs

modify the tree and calculate its likelihood
if the modified tree is an improvement, keep it. Else, return to the previous tree

Modifications include topology changes and changes to branch lengths.