19 May 2017

Distance based methods

  • Usually given sequences
  • Define a distance between sequences
  • Calculate distances between each pair of sequences
  • Build tree from resulting distance matrix
  • Already seen UPGMA. Will also look at neighbour-joining

Defining distances: Hamming distance

Hamming distance for an aligned pair \(x\), \(y\), \[ D_{xy} = \mbox{ number of places } x \mbox{ and } y \mbox{ differ.} \]

Normalise this by dividing by sequence length \(L\) to get \[ f_{xy} = \frac{ D_{xy}} L = \mbox{ fraction of sites at which } x \mbox{ and } y \mbox{ differ.}\]

\(f\) captures evolutionary distance well for small distances, but doesn't grow very fast.

Jukes-Cantor distance

Based on model where mutations between all four bases are equally likely.

Corrects for fact that unrelated sequences will agree simply due to chance

\[d_{xy} = -\frac34 \log \left (1- \frac{4f_{xy}}3\right).\]

Can use other mutation models to define other distances.