Number of capital letters

Using rhermans' testtext, I would propose this:

HammingDistance[testtext, ToLowerCase[testtext]] // RepeatedTiming

{0.00025, 185}

(Update: ) Here is another one:

uppercase = Alternatives @@ CharacterRange["A", "Z"];
StringCount[testtext, uppercase] // RepeatedTiming

{0.00066, 185}

Using Length@StringCases instead of StringCount takes about the same amount of time, 0.00068 in my experiment.

Also, Mr. Wizard mentioned using regular expressions, which can be made fast in the same way (which is not surprising because string patterns are internally converted to regular expressions):

regex = RegularExpression@First@StringPattern`PatternConvert[uppercase];
StringCount[testtext, regex] // RepeatedTiming

{0.00067, 185}

However, on my computer this is only as fast as using the more appropriate regular expression suggested by Mr. Wizard, it is neither better nor worse.

We can compare this to rhermans' approach:

RepeatedTiming[StringCount[testtext, _?UpperCaseQ]]

{0.00336, 185}

Both the HammingDistance approach and the StringCases approach are an order of magnitude faster.

It seems to me you are counting the number of different letters that apear as uppercase and not the total number of uppercase characters present. Here I use the UN Declaration of Human Rights without PunctuationCharacter as sample text. I use RepeatedTiming to show the average time in seconds used to compute, together with the result obtained.

Test data

testtext = StringDelete[
   ExampleData[{"Text", "UNHumanRightsEnglish"}]
   , PunctuationCharacter];

Your approach

RepeatedTiming[
 Total[Boole[UpperCaseQ /@ Keys[LetterCounts[testtext]]]]
 ]
(* {0.0564, 21} *)

What I think you intended

RepeatedTiming[
 Total@KeySelect[LetterCounts[testtext], UpperCaseQ]
 ]
(* {0.059, 185} *)

My approach

RepeatedTiming[
 StringCount[testtext, _?UpperCaseQ]
 ]
(* {0.0032, 185} *)

@C.E. offers the best answer so far, using HammingDistance. Using a similar approach but with EditDistance it's the slowest of all.

BarChart[
 First /@ {
    RepeatedTiming[(* "HammingDistance" @C.E *)
     HammingDistance[testtext, ToLowerCase[testtext]]
     ],
    RepeatedTiming[(* "StringCount Alternatives CharacterRange" @C.E *)


     StringCount[testtext, Alternatives @@ CharacterRange["A", "Z"]]
     ],
    RepeatedTiming[(* "StringCount RegularExpression" @Mr.Wizard *)
     StringCount[testtext, RegularExpression["\\p{Lu}"]]
     ],
    RepeatedTiming[ (* "StringCount UpperCaseQ" @rhermans *)
     StringCount[testtext, _?UpperCaseQ]
     ],
    RepeatedTiming[(* "Length@StringCases" *)
     Length@StringCases[testtext, _?UpperCaseQ]
     ],
    RepeatedTiming[(* "Total KeySelect LetterCounts" @
     KirilDanilchenko *)
     Total@KeySelect[LetterCounts[testtext], UpperCaseQ]
     ],
    RepeatedTiming[(* "EditDistance" @rhermans *)
     EditDistance[testtext, ToLowerCase[testtext]]
     ]
    } 10^6
 , ChartLabels -> {
   "HammingDistance\[email protected]",
   "StringCount\nAlternatives\nCharacterRange\[email protected]",
   "StringCount\nRegularExpression\[email protected] ",
   "StringCount\nUpperCaseQ\n@rhermans",
   "Length@StringCases",
   "Total\nKeySelect\nLetterCounts\n@KirilDanilchenko",
   "EditDistance\n@rhermans"
   }
 , PlotTheme -> "Scientific"
 , AspectRatio -> 1/2
 , ImageSize -> 600
 , FrameLabel -> {None, "Time \[Mu]s"}
 , BarSpacing -> Large
 , ScalingFunctions -> "Log"
 ]

Mathematica graphics

On my system (v10.1 under Windows) RegularExpression is not quite as fast as HammingDistance but arguably it is more flexible, and much faster than UpperCaseQ

text = ExampleData[{"Text", "DeclarationOfIndependence"}];

StringCount[text, RegularExpression["\\p{Lu}"]]  // RepeatedTiming

{0.000420, 342}

cf.

StringCount[text, _?UpperCaseQ]                  // RepeatedTiming
HammingDistance[text, ToLowerCase[text]]         // RepeatedTiming

{0.00573, 342}
{0.000300, 342}

Number of capital letters

Tags:

Benchmarking

String Manipulation

Natural Language

Related

Recent Posts