Bigrams and TF-IDF calculation
What do you think about skipping the StringJoin
and storing a bigram as a pair of strings?
getbigrams[text_String] := Module[{words},
words =
StringSplit[
ToLowerCase[StringDelete[text, PunctuationCharacter]]];
Counts[Partition[words, 2, 1]]
]
That can save about 40 % of time:
data = ExampleData /@ ExampleData["Text"];
a = Table[
Merge[<|First[#] <> " " <> Last[#] -> 1|> & /@
Partition[
StringSplit[
StringReplace[ToLowerCase[data[[i]]],
PunctuationCharacter -> ""]], 2, 1], Total], {i, 1,
Length@data}]; // AbsoluteTiming // First
b = getbigrams /@ data; // AbsoluteTiming // First
Values[a] == Values[b]
7.62668
4.40748
True
I've implemented the Bigrams and TF-IDF calculation according to @Henrik Schumacher suggestion (see the attached code). Thanks a lot, @Henrik Schumacher.
bigramUniqe = <||>
getbigrams[text_String] := Module[{words, res, temp},
words = StringSplit[StringDelete[text, PunctuationCharacter]];
temp = Partition[words, 2, 1];
Scan[(If[MissingQ[bigramUniqe[#]], AssociateTo[bigramUniqe, # -> 1],
AssociateTo[bigramUniqe, # -> (bigramUniqe[#] + 1)]]) &, temp];
res = Counts[temp]
]
bigram=TF
bigram = Table[getbigrams[data[[i]]], {i, 1, Length@data}];
bigramUniqe=iTF
KeyValueMap[# -> Log[2, Length@data/bigramUniqe[[Key[#]]]] &, bigramUniqe]