Commonest[] function doesn't actually show commonest elements

You should be using TextWords to segment your data into words. Things like StringCount[data, "far"] will also count "fart".

Commonest[TextWords[txt], 10]

{"affirm", "calligrapher", "squander", "validly", "autoimmune", "equation", "nematode", "veronica", "crispness", "ashen"}

WordCloud[TextWords[txt]]

enter image description here

You can use Counts to get the counts of each word as well:

TakeLargest[Counts[TextWords[txt]], 20]

<|"affirm" -> 29, "equation" -> 28, "veronica" -> 28, "ashen" -> 28, "crispness" -> 28, "knacker" -> 27, "validly" -> 27, "squander" -> 27, "nematode" -> 27, "autoimmune" -> 27, "calligrapher" -> 27, "pus" -> 26, "sledding" -> 26, "tablecloth" -> 26, "inclusive" -> 26, "variegated" -> 26, "gastrointestinal" -> 26, "undercoat" -> 26, "washout" -> 26, "reconnoitering" -> 26|>

It seems to me that the issue with WordCloud is actually an issue within DeleteStopwords, which WordCloud is using internally when the input is a string.

enter image description here

You can prevent WordCloud from using DeleteStopwords by passing PreprocessingRules -> None:

enter image description here

It seems to me that DeleteStopwords is deleting many words that perhaps it shouldn't be:

Complement[TextWords[txt], TextWords[DeleteStopwords[txt]]]

{"a", "about", "above", "across", "add-on", "after", "again", \
"against", "all", "almost", "alone", "along", "already", "also", \
"although", "always", "among", "an", "and", "another", "any", \
"anyone", "anything", "anywhere", "are", "around", "as", "at", \
"back", "back-to-back", "be", "because", "become", "before", \
"behind", "being", "below", "between", "born-again", "both", \
"built-in", "but", "by", "can-do", "custom-made", "do", "done", \
"down", "during", "each", "either", "enough", "even", "ever", \
"every", "everyone", "everything", "everywhere", "far-off", \
"far-out", "few", "find", "first", "for", "four", "from", "full", \
"further", "get", "give", "go", "have-not", "he", "head-on", "her", \
"here", "hers", "herself", "him", "himself", "his", "how", "however", \
"if", "in", "interest", "into", "it", "its", "itself", "keep", \
"laid-back", "last", "least", "less", "ma'am", "made", "man-made", \
"many", "may", "me", "might", "more", "most", "mostly", "much", \
"must", "my", "myself", "never", "next", "nobody", "nor", "no-show", \
"not", "nothing", "now", "nowhere", "of", "off", "often", "on", \
"once", "one", "only", "other", "our", "ours", "ourselves", "out", \
"over", "own", "part", "per", "perhaps", "put", "rather", \
"runner-up", "same", "seem", "seeming", "see-through", \
"self-interest", "self-made", "several", "she", "show", "side", \
"since", "sit-in", "so", "some", "someone", "something", "somewhere", \
"still", "such", "take", "than", "that", "the", "their", "theirs", \
"them", "themselves", "then", "there", "therefore", "these", "they", \
"this", "those", "though", "three", "through", "thus", "to", \
"together", "too", "toward", "two", "under", "until", "up", "upon", \
"us", "very", "we", "well", "well-to-do", "what", "when", "where", \
"where's", "whether", "which", "while", "who", "whole", "whom", \
"whose", "why", "will", "with", "within", "without", "would-be", \
"write-off", "yet", "you", "your", "yours", "yourself"}

I agree with some of those stopwords, but not really any of them that contain the - character. This is perhaps where the issue lies.

What appears to be happening is that DeleteStopwords is deleting part of some words, and what's left over is counted. We can see the outcome:

Counts[TextWords[txt]]["far"]

19

Counts[TextWords[DeleteStopwords[txt]]]["far"]

39

We can see that this behaviour is weird by comparing the following:

Select[TextWords[txt], StringStartsQ["far"]] // Counts // ReverseSort

<|"farinaceous" -> 19, "far" -> 19, "fare" -> 19, "faro" -> 18, "farther" -> 17, "farmer" -> 17, "farcical" -> 17, "farthing" -> 17, "faraway" -> 16, "farmstead" -> 16, "farrier" -> 15, "farthermost" -> 14, "far-off" -> 14, "farming" -> 14, "farrago" -> 13, "farm" -> 13, "farcically" -> 13, "farrowing" -> 12, "farce" -> 11, "farsighted" -> 11, "farmland" -> 10, "farsightedness" -> 10, "farmhouse" -> 9, "farseeing" -> 9, "farad" -> 8, "farina" -> 8, "farthest" -> 8, "farmhand" -> 7, "farewell" -> 7, "farrow" -> 6, "farmyard" -> 6, "far-out" -> 6|>

Select[TextWords[DeleteStopwords@txt], StringStartsQ["far"]] // Counts // ReverseSort

<|"far" -> 39, "farinaceous" -> 19, "fare" -> 19, "faro" -> 18, "farther" -> 17, "farmer" -> 17, "farcical" -> 17, "farthing" -> 17, "faraway" -> 16, "farmstead" -> 16, "farrier" -> 15, "farthermost" -> 14, "farming" -> 14, "farrago" -> 13, "farm" -> 13, "farcically" -> 13, "farrowing" -> 12, "farce" -> 11, "farsighted" -> 11, "farmland" -> 10, "farsightedness" -> 10, "farmhouse" -> 9, "farseeing" -> 9, "farad" -> 8, "farina" -> 8, "farthest" -> 8, "farmhand" -> 7, "farewell" -> 7, "farrow" -> 6, "farmyard" -> 6|>

Here we can see that DeleteStopwords is replacing "far-out" and "far-off" with "far-", which is segmented to "far" by TextWords, which completely throws off WordCloud's counting mechanism in this case.

As already noted by Carl, you have blamed the wrong function. Had you imported the text file in the proper format, you would have gotten the expected results:

words = Import["https://pastebin.com/raw/Z0hd3huU", "List"];

AllTrue[words, StringQ]
   True

Take[words, 10]
   {"inconsiderate", "weighting", "unneeded", "issuing", "intemperately", "perverse",
    "disgruntled", "ninja", "artificially", "seduce"}

Note that this import format already yielded a list of strings as opposed to a single string.

Commonest[words, 10]
   {"affirm", "calligrapher", "squander", "validly", "autoimmune", "equation", "nematode",
    "veronica", "crispness", "ashen"}

TakeLargest[Counts[words], 10]
   <|"affirm" -> 29, "equation" -> 28, "ashen" -> 28, "crispness" -> 28, "veronica" -> 28,
     "squander" -> 27, "validly" -> 27, "calligrapher" -> 27, "autoimmune" -> 27,
     "nematode" -> 27|>

WordCloud[words]

word cloud

Commonest[] function doesn't actually show commonest elements

Tags:

Filtering

List Manipulation

String Manipulation

Related

Recent Posts