CS 369: Tree likelihood and Substitution models

30 May 2016

Likelihood on trees

Data is \(D\), a \(n\times L\) matrix

\(D_{ij}\) is value of \(j\)th residue in \(i\) sequence

Want to calculate \(\mathcal L(T) = P(D|T,\mu)\) where \(T\) is binary trees with \(n\) leaves and sequence–a row of \(D\)–associated with each leaf. \(\mu\) are paramters of mutation model.

\(P(D|T) = \prod_{i= 1}^L P(D_{:,i}|T)\) since sites independent

To find \(P(D_{:,i}|T)\) need to have unknown ancestral values, \(A_{:,i}\)

Don't know \(A_{:,i}\) so sum over all possible vallue: \(P(D_{:,i}|T) = \sum_{A_{:,i}} P(D_{:,i}, A_{:,i} |T)\)

Can do this sum using dynamic programming (see Felsenstein 1973, 1981)

Likelihood on trees

So \(P(D_{:,i}, A_{:,i} |T)\) is likelihood of tree at a site where all values are known at every node of tree.

Assume independence across all lineages, then above becomes product of likelihood of each lineage: need only model to tell us probability of seeing base \(a\) at top of lineage and base \(b\) at bottom of lineage given the length of the lineage.

Use a Markov process to model this

Markov processes

In a Markov process, time is continuous. Sometimes called continuous time Markov chain.

Basic idea:

Finite number of states. (extends to countable number of states)
System occupies a state at any given time.
Amount of time spent in a state is exponentially distributed with some parameter.
After that time, it "jumps" to another state.

Markov process is a sequence of random variables \(\{ X (t) \}\) where \(X(t)\) is some state for any \(t\) and \(t\) is continuous

Poisson process: example of a Markov process

States of Poisson process are all natural numbers.
Time spent in each state is exponential with rate \(\lambda\)

Markov process for substituion models

Model value of a site of DNA sequence by a Markov process, \(X(t)\)

Possible states: \(X(t) \in \{A,C,G,T \}\))

General Markov process has time between jumps depends on the state we are in (contrast with Poisson process)

E.g., time spent in \(A\) may be different from time spent in \(C\).

Time spent in states

All times between jumps are exponentially distributed.

Let \(q_{ab}\) be the rate of transitions from state \(a\) to state \(b\), for any states \(a \neq b\).

Total rate of transitions from \(a\) is \[q_{aa} = \sum_{b \neq a} q_{ab}.\]

So length of time spent in state \(a\) is exponentially distributed with rate \(q_{aa}\)

At a jump, move into state \(b\) with probability \[\frac{q_{ab}}{\sum_{k \neq a} q_{ak}} = \frac{q_{ab}}{ q_{aa}}\]

Rate matrices, \(Q\)

A matrix, usually denoted \(Q\), shows the rates of transitions between states.

Off-diagonal elements are \(q_{ab}\), \(a \neq b\), the rate of moving from \(a\) to \(b\).

Diagonal elements are \(-q_{aa}\), negative of total rate out of \(a\), \(q_{aa} = \sum_{b \neq a} q_{ab}\).

In the case of modelling DNA mutations, \(Q\) has the form \[ Q = \renewcommand{\arraystretch}{1.25} \left( \begin{array}{cccc} -\sum_{j \neq A}q_{Aj} & q_{AC} & q_{AG} & q_{AT} \\ q_{CA} & -\sum_{j \neq C}q_{Cj} & q_{CG} & q_{CT} \\ a_{GA} & q_{GC} & -\sum_{j \neq G}q_{Gj} & q_{GT} \\ q_{TA} & q_{TC} & q_{TG} & -\sum_{j \neq T}q_{Tj} \end{array} \right).\]

Each row of \(A\) sums to 0, by construction.

Example of Markov process

Consider the rate matrix

##    A  C  G  T
## A -8  2  5  1
## C  2 -7  2  3
## G  5  2 -9  2
## T  1  3  2 -6

or using 1,2,3,4 for the DNA bases:

##    1  2  3  4
## 1 -8  2  5  1
## 2  2 -7  2  3
## 3  5  2 -9  2
## 4  1  3  2 -6

So lets simulate the process for time 10:

mp = simMarkovPro(Q,T = 10)

Print the first 50 jump times:

##  [1] 0.000 0.037 0.221 0.302 0.528 0.698 0.836 0.839 0.945 1.258 1.371
## [12] 1.733 1.883 1.888 1.993 2.005 2.051 2.198 2.434 2.605 2.860 2.903
## [23] 3.157 3.207 3.348 3.474 3.737 3.762 3.875 4.174 4.736 4.806 4.837
## [34] 5.046 5.354 5.379 5.386 5.444 5.695 5.734 5.856 5.908 5.956 5.960
## [45] 6.012 6.120 6.316 6.458 6.610 6.843

And first 50 states visited:

##  [1] 4 2 1 3 1 2 4 1 2 4 2 1 4 2 4 2 1 2 3 1 3 2 3 1 4 3 1 4 2 4 1 2 4 2 3
## [36] 4 2 1 3 1 3 2 3 1 2 4 3 4 3 4

Easier to understand visually so plot it:

Just the first 10 jumps:

From rates to probabilities

In some state now. What is the probability of being in this or any other state at some later time point?

Define \(P_{ab}(t) = \Pr(X(t) = b|X(0) = a)\) the probability of starting in state \(a\) at time 0 and being in state \(b\) at time \(t\)

Assume rates stay the same through the process so \(P_{ab}(t) = P(X(s+t) = b | X(s) = a)\)

It turns out that we can derive \(P(t)\) from \(Q\) by taking the matrix exponential: \[ P(t) = \exp(Q t) = \sum_{k = 0}^{\infty} \frac{ (Q t)^k}{k!} = \mathbf I + tQ + \frac{(t \mathbf Q)^2}{2} + \frac{(t \mathbf Q)^3}{6} + \ldots\]

Matrix exponential

Can sometimes calculate analytically

For general \(Q\), use numerical methods from numerical libraries.

The matrix exponential follows the rules we would hope for the exponential function. For example:

\(\exp(\mathbf 0) = \mathbf I_n\)
when \(\mathbf{AB} = \mathbf{BA}\), \(\exp(\mathbf{A+B}) =\exp(\mathbf{A})\exp(\mathbf{B})\)
When \(\mathbf B\) is invertible, \(\exp(\mathbf{BAB}^{-1}) = \mathbf{B}\exp(\mathbf{A})\mathbf B^{-1}\)
If \(\mathbf A = \mathrm{diag}(a_1,\ldots,a_n)\), then \(\exp(\mathbf A) = \mathrm{diag}(\exp(a_1),\ldots,\exp(a_n))\).

Matrix exponential example

Suppose there are two states, 1 and 2.

Let the rate matrix be

\[Q = \renewcommand{\arraystretch}{1.25} \left( \begin{array}{cc} -1 & 1 \\ 0.6 & -0.6 \end{array} \right).\]

To get the probability matrix, use the matrix exponential:

\[ P(1) = \exp(1.Q) = \mathbf I + Q + \frac{ \mathbf Q^2}{2} + \frac{ \mathbf Q^3}{6} + \ldots = \renewcommand{\arraystretch}{1.25} \left( \begin{array}{cc} 0.501 & 0.499 \\ 0.299 & 0.701 \end{array} \right) \]

So if we started in state 2 at time 0, at time 1 there is a 29.9% chance of being in state 1 and a 70.1% chance of being in state 2.

By time 2, the preference for state 2 is obvious: \[ P(2) = \exp(2Q) = \renewcommand{\arraystretch}{1.25} \left( \begin{array}{cc} 0.400 & 0.600 \\ 0.360 & 0.640 \end{array} \right) \]

Remember, can calculate for any value of \(t\), not just integers:

\[ P(0.234) = \exp(0.234 \times Q) = \renewcommand{\arraystretch}{1.25} \left( \begin{array}{cc} 0.805 & 0.195 \\ 0.117 & 0.883 \end{array} \right) \]

If we ran it for an infinitely long period of time, get the same probabilities regardless of start state:

\[ P(\infty ) =\renewcommand{\arraystretch}{1.25} \left( \begin{array}{cc} 0.375 & 0.625 \\ 0.375 & 0.625 \end{array} \right) \]

Each row is a copy of the equilibrium distribution or stationary distribution for the process.

Want this property of stationarity for models of DNA substitution so that

\[ P(\infty ) =\renewcommand{\arraystretch}{1.25} \left( \begin{array}{cccc} \pi_A & \pi_C & \pi_G & \pi_T \\ \pi_A & \pi_C & \pi_G & \pi_T \\ \pi_A & \pi_C & \pi_G & \pi_T \\ \pi_A & \pi_C & \pi_G & \pi_T \end{array} \right) \]

\(\pi = [\pi_A , \pi_C , \pi_G , \pi_T]\) is the stationary distribution of the bases.

Matrix exponential example 2

##    1  2  3  4
## 1 -8  2  5  1
## 2  2 -7  2  3
## 3  5  2 -9  2
## 4  1  3  2 -6

# P(0)
expm(Q*0)

## 4 x 4 diagonal matrix of class "ddiMatrix"
##   1 2 3 4
## 1 1 . . .
## 2 . 1 . .
## 3 . . 1 .
## 4 . . . 1

# P(0.01)
round(expm(Q*0.01),2)

## 4 x 4 Matrix of class "dsyMatrix"
##      1    2    3    4
## 1 0.92 0.02 0.05 0.01
## 2 0.02 0.93 0.02 0.03
## 3 0.05 0.02 0.92 0.02
## 4 0.01 0.03 0.02 0.94

# P(0.1)
round(expm(Q*0.1),2)

## 4 x 4 Matrix of class "dsyMatrix"
##      1    2    3    4
## 1 0.52 0.14 0.24 0.10
## 2 0.14 0.55 0.14 0.18
## 3 0.24 0.14 0.49 0.13
## 4 0.10 0.18 0.13 0.59

# P(1)
round(expm(Q*1),4)

## 4 x 4 Matrix of class "dsyMatrix"
##        1      2      3      4
## 1 0.2504 0.2498 0.2503 0.2495
## 2 0.2498 0.2501 0.2499 0.2502
## 3 0.2503 0.2499 0.2502 0.2497
## 4 0.2495 0.2502 0.2497 0.2506

# P(100)
expm(Q*100)

## 4 x 4 Matrix of class "dsyMatrix"
##      1    2    3    4
## 1 0.25 0.25 0.25 0.25
## 2 0.25 0.25 0.25 0.25
## 3 0.25 0.25 0.25 0.25
## 4 0.25 0.25 0.25 0.25