What is a Markov Chain?
A Markov chain is a discrete random process with the property that the next state depends only on the current state (wikipedia)
So $P(X_n | X_1, X_2, \dots X_{n-1}) = P(X_n | X_{n-1})$. An example could be when you are modelling the weather. You then can take the assumption that the weather of today can be predicted by only using the knowledge of yesterday.
Let's say we have Rainy and Sunny. When it is rainy on one day the next day is Sunny with probability $0.3$. When it is Sunny, the probability for Rain next day is $0.4$.
Now when it is today Sunny we can predict the weather of the day after tomorrow, by simply calculating the probability for Rain tomorrow, multiplying that with the probablity for Sun after rain plus the probability of Sun tomorrow times the probability of Sun after sun. In total the probability of Sunny of the day after tomorrow is $P(R|S) \cdot P(S|R) + P(S|S) \cdot P(S|S) = 0.3 \cdot 0.4+0.6 \cdot 0.6 = 0.48$.
In a nutshell, a Markov chain is (the behavior of) a random process which may only find itself in a (not necessarily finite) number of different states. The process moves from a state to another in discrete times (that is, you define a sequence S(t) of states at time t=0,1,2,...), and for which the probability of going from state S to state R depends just from S and R; that is, there is no "memory of the past" and the process is "timeless". This means that the Markov chain may be modeled as a n*n matrix, where n is the number of possible states.
An example of a process which may be modeled by a Markov chain is the sequence of faces of a die showing up, if you are allowed to rotate the die wrt an edge. The corresponding matrix is
1 2 3 4 5 6
------------------------
1 | 0 1/4 1/4 1/4 1/4 0
2 | 1/4 0 1/4 1/4 0 1/4
3 | 1/4 1/4 0 0 1/4 1/4
4 | 1/4 1/4 0 0 1/4 1/4
5 | 1/4 0 1/4 1/4 0 1/4
1 | 0 1/4 1/4 1/4 1/4 0
As usual, Wikipedia and MathWorld are your friends.
Markov chains, especially hidden Markov models are hugely important in computation linguistics. A hidden Markov model is one where we can't directly view the state, but we do have some information about what the state might be. For example, consider breaking down a sentence into what is called "parts of speech" such as verbs, adjectives, ect. We don't know what the parts of speech are, but we can attempt to deduce them from the word. For example, the word run might be used 80% as a verb, 18% of the time as a noun and 2% of the time as an adjective. We also have (Markov) relations between the parts of speech, so for example an adjective might be followed by a noun 70% of the time and another adjective 30% of the time.
We can use the Viterbi algorithm to decide which sequence is most likely to have generated the observed sentence. This algorithm takes into account two factors:
- the probability of such a sequence of parts of speech occurring together in a sentence
- the relative chance that such a sequence of parts of speech would be responsible for us observing the given sentence.