Constructing transition probability matrix
You can use the package CrossTabulate.m. (More detailed references are given in this MSE answer.)
Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/CrossTabulate.m"]
cmat = CrossTabulate[Partition[x, 2, 1]];
MatrixForm[cmat]
cmat["SparseMatrix"] = cmat["SparseMatrix"]/Total[cmat["SparseMatrix"], {2}];
MatrixForm[cmat]
ArrayRules[cmat["SparseMatrix"]]
(* {{1, 1} -> 5/9, {1, 5} -> 1/9, {1, 4} -> 2/9, {1, 3} -> 1/
9, {2, 5} -> 2/3, {2, 1} -> 1/3, {3, 2} -> 1/5, {3, 1} -> 1/
5, {3, 3} -> 2/5, {3, 4} -> 1/5, {4, 4} -> 3/8, {4, 3} -> 1/
8, {4, 2} -> 1/4, {4, 1} -> 1/8, {4, 5} -> 1/8, {5, 4} -> 2/
5, {5, 5} -> 2/5, {5, 3} -> 1/5, {_, _} -> 0} *)
You can use SparseArray
with additive assembly as follows:
x = RandomChoice[Alphabet["English", "IndexCharacters"], 1000000];
data = Flatten[ToCharacterCode[x]] - (ToCharacterCode["A"][[1]] - 1); // AbsoluteTiming // First
A = With[{
spopt = SystemOptions["SparseArrayOptions"]},
Internal`WithLocalSettings[
(*switch to additive assembly*)
SetSystemOptions["SparseArrayOptions" -> {"TreatRepeatedEntries" -> Total}],
(*assemble matrix*)
SparseArray[
Partition[data, 2, 1] -> 1,
Max[data] {1, 1}
]
,
(*reset "SparseArrayOptions" to previous value*)
SetSystemOptions[spopt]]
]; // AbsoluteTiming // First
0.739454
0.114682
Remarks
As the timings suggest, it is worthwhile to avoid strings in the first place.
Formerly, I used
LetterNumber
, butToCharacterCode
is much, much faster.It is
"TreatRepeatedEntries" -> Total
which enables summing of entries.Count
is not needed anymore.Developer`ToPackedArray
might speed up things a bit ifx
is very long. The other hokus-pokus is for making things bulletproof against aborts (i.e., options are reset even if computations are interrupted). See also (37566) and (136017).
You can also use EstimatedProcess
and MarkovProcessProperties
as follows:
states = DeleteDuplicates[x];
ordering = Ordering[states];
data = ArrayComponents @ x ;
estproc = EstimatedProcess[data, DiscreteMarkovProcess[Length@states]];
tm = MarkovProcessProperties[estproc, "TransitionMatrix"][[ordering, ordering]]
TeXForm[TableForm[tm, TableHeadings -> {states[[ordering]], states[[ordering]]}]]
$\begin{array}{cccccc} & \text{A} & \text{B} & \text{C} & \text{D} & \text{E} \\ \text{A} & \frac{5}{9} & 0 & \frac{1}{9} & \frac{2}{9} & \frac{1}{9} \\ \text{B} & \frac{1}{3} & 0 & 0 & 0 & \frac{2}{3} \\ \text{C} & \frac{1}{5} & \frac{1}{5} & \frac{2}{5} & \frac{1}{5} & 0 \\ \text{D} & \frac{1}{8} & \frac{1}{4} & \frac{1}{8} & \frac{3}{8} & \frac{1}{8} \\ \text{E} & 0 & 0 & \frac{1}{5} & \frac{2}{5} & \frac{2}{5} \\ \end{array}$