Number of capital letters
Using rhermans' testtext
, I would propose this:
HammingDistance[testtext, ToLowerCase[testtext]] // RepeatedTiming
{0.00025, 185}
(Update: ) Here is another one:
uppercase = Alternatives @@ CharacterRange["A", "Z"];
StringCount[testtext, uppercase] // RepeatedTiming
{0.00066, 185}
Using Length@StringCases
instead of StringCount
takes about the same amount of time, 0.00068
in my experiment.
Also, Mr. Wizard mentioned using regular expressions, which can be made fast in the same way (which is not surprising because string patterns are internally converted to regular expressions):
regex = RegularExpression@First@StringPattern`PatternConvert[uppercase];
StringCount[testtext, regex] // RepeatedTiming
{0.00067, 185}
However, on my computer this is only as fast as using the more appropriate regular expression suggested by Mr. Wizard, it is neither better nor worse.
We can compare this to rhermans' approach:
RepeatedTiming[StringCount[testtext, _?UpperCaseQ]]
{0.00336, 185}
Both the HammingDistance
approach and the StringCases
approach are an order of magnitude faster.
It seems to me you are counting the number of different letters that apear as uppercase and not the total number of uppercase characters present. Here I use the UN Declaration of Human Rights without PunctuationCharacter
as sample text. I use RepeatedTiming
to show the average time in seconds used to compute, together with the result obtained.
Test data
testtext = StringDelete[
ExampleData[{"Text", "UNHumanRightsEnglish"}]
, PunctuationCharacter];
Your approach
RepeatedTiming[
Total[Boole[UpperCaseQ /@ Keys[LetterCounts[testtext]]]]
]
(* {0.0564, 21} *)
What I think you intended
RepeatedTiming[
Total@KeySelect[LetterCounts[testtext], UpperCaseQ]
]
(* {0.059, 185} *)
My approach
RepeatedTiming[
StringCount[testtext, _?UpperCaseQ]
]
(* {0.0032, 185} *)
@C.E. offers the best answer so far, using HammingDistance
. Using a similar approach but with EditDistance
it's the slowest of all.
BarChart[
First /@ {
RepeatedTiming[(* "HammingDistance" @C.E *)
HammingDistance[testtext, ToLowerCase[testtext]]
],
RepeatedTiming[(* "StringCount Alternatives CharacterRange" @C.E *)
StringCount[testtext, Alternatives @@ CharacterRange["A", "Z"]]
],
RepeatedTiming[(* "StringCount RegularExpression" @Mr.Wizard *)
StringCount[testtext, RegularExpression["\\p{Lu}"]]
],
RepeatedTiming[ (* "StringCount UpperCaseQ" @rhermans *)
StringCount[testtext, _?UpperCaseQ]
],
RepeatedTiming[(* "Length@StringCases" *)
Length@StringCases[testtext, _?UpperCaseQ]
],
RepeatedTiming[(* "Total KeySelect LetterCounts" @
KirilDanilchenko *)
Total@KeySelect[LetterCounts[testtext], UpperCaseQ]
],
RepeatedTiming[(* "EditDistance" @rhermans *)
EditDistance[testtext, ToLowerCase[testtext]]
]
} 10^6
, ChartLabels -> {
"HammingDistance\[email protected]",
"StringCount\nAlternatives\nCharacterRange\[email protected]",
"StringCount\nRegularExpression\[email protected] ",
"StringCount\nUpperCaseQ\n@rhermans",
"Length@StringCases",
"Total\nKeySelect\nLetterCounts\n@KirilDanilchenko",
"EditDistance\n@rhermans"
}
, PlotTheme -> "Scientific"
, AspectRatio -> 1/2
, ImageSize -> 600
, FrameLabel -> {None, "Time \[Mu]s"}
, BarSpacing -> Large
, ScalingFunctions -> "Log"
]
On my system (v10.1 under Windows) RegularExpression
is not quite as fast as HammingDistance
but arguably it is more flexible, and much faster than UpperCaseQ
text = ExampleData[{"Text", "DeclarationOfIndependence"}];
StringCount[text, RegularExpression["\\p{Lu}"]] // RepeatedTiming
{0.000420, 342}
cf.
StringCount[text, _?UpperCaseQ] // RepeatedTiming
HammingDistance[text, ToLowerCase[text]] // RepeatedTiming
{0.00573, 342} {0.000300, 342}