Markov models

From Derek
Revision as of 14:49, 18 May 2009 by Mattjb (talk | contribs)
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

In a first order Markov model, you compute the letter transition probabilities, P(a|b), etc. from the frequencies, so for example for every pair of letters in abbcab the counts are

  • C(ab)=2
  • C(bb)=1
  • C(bc)=1
  • C(ca)=1
  • C(a)=2
  • C(b)=3
  • C(c)=1

Then compute the probabilities:

  • P(b|a)=2/(2+1+1+1)
  • P(b|b)=1/(2+1+1+1)
  • P(b|c)=1/(2+1+1+1)
  • P(c|a)=1/(2+1+1+1)
  • P(anyothercombinations)=0
  • P(a)=2/(2+3+1)
  • P(b)=3/(2+3+1)
  • P(c)=1/(2+3+1)
  • P(d)..P(z)=0

Then for some sequence of letters x1,,xn (the code) The likelihood that that code came from a language (or initial letters of language) that our sequence abbcab came from is P(x1)P(x2|x1)P(xn|xn1)= a product of probabilities, assuming independence (since you are treating it as a first Note that you should first take logs (using Java doubles), then compute the sum of the log(probabilities), then raise back as a power of the base of the logarithm, otherwise you'll get too many numerical errors. You also need to define log0=0. (i.e. in the coding logic, if the probability is zero, then add zero, otherwise then compute the log and add that) The first term is why you need the single letter frequencies as well.

Doing this for the case where you have fifty samples of a text gives you a confidence level in your final p-value, but it makes the estimates of the probabilities wrong. 50 may be too many for short texts.

You can extend this to higher order Markov models, eg. for a 2nd order one you estimate P(b|bc) from counts of bcb, etc.