Bigrams and TF-IDF calculation

What do you think about skipping the StringJoin and storing a bigram as a pair of strings?

getbigrams[text_String] := Module[{words},
  words = 
   StringSplit[
    ToLowerCase[StringDelete[text, PunctuationCharacter]]];
  Counts[Partition[words, 2, 1]]
  ]

That can save about 40 % of time:

data = ExampleData /@ ExampleData["Text"];

a = Table[
     Merge[<|First[#] <> " " <> Last[#] -> 1|> & /@ 
       Partition[
        StringSplit[
         StringReplace[ToLowerCase[data[[i]]], 
          PunctuationCharacter -> ""]], 2, 1], Total], {i, 1, 
      Length@data}]; // AbsoluteTiming // First

b = getbigrams /@ data; // AbsoluteTiming // First

Values[a] == Values[b]

7.62668

4.40748

True


I've implemented the Bigrams and TF-IDF calculation according to @Henrik Schumacher suggestion (see the attached code). Thanks a lot, @Henrik Schumacher.

 bigramUniqe = <||>

 getbigrams[text_String] := Module[{words, res, temp},
 words = StringSplit[StringDelete[text, PunctuationCharacter]];
 temp = Partition[words, 2, 1];
 Scan[(If[MissingQ[bigramUniqe[#]], AssociateTo[bigramUniqe, # -> 1],
   AssociateTo[bigramUniqe, # -> (bigramUniqe[#] + 1)]]) &, temp];
 res = Counts[temp]
 ]

bigram=TF

 bigram = Table[getbigrams[data[[i]]], {i, 1, Length@data}]; 

bigramUniqe=iTF

 KeyValueMap[# ->  Log[2, Length@data/bigramUniqe[[Key[#]]]] &, bigramUniqe]