Commonest[] function doesn't actually show commonest elements
You should be using TextWords
to segment your data into words. Things like StringCount[data, "far"]
will also count "fart".
Commonest[TextWords[txt], 10]
{"affirm", "calligrapher", "squander", "validly", "autoimmune", "equation", "nematode", "veronica", "crispness", "ashen"}
WordCloud[TextWords[txt]]
You can use Counts
to get the counts of each word as well:
TakeLargest[Counts[TextWords[txt]], 20]
<|"affirm" -> 29, "equation" -> 28, "veronica" -> 28, "ashen" -> 28, "crispness" -> 28, "knacker" -> 27, "validly" -> 27, "squander" -> 27, "nematode" -> 27, "autoimmune" -> 27, "calligrapher" -> 27, "pus" -> 26, "sledding" -> 26, "tablecloth" -> 26, "inclusive" -> 26, "variegated" -> 26, "gastrointestinal" -> 26, "undercoat" -> 26, "washout" -> 26, "reconnoitering" -> 26|>
It seems to me that the issue with WordCloud
is actually an issue within DeleteStopwords
, which WordCloud
is using internally when the input is a string.
You can prevent WordCloud
from using DeleteStopwords
by passing PreprocessingRules -> None
:
It seems to me that DeleteStopwords
is deleting many words that perhaps it shouldn't be:
Complement[TextWords[txt], TextWords[DeleteStopwords[txt]]]
{"a", "about", "above", "across", "add-on", "after", "again", \ "against", "all", "almost", "alone", "along", "already", "also", \ "although", "always", "among", "an", "and", "another", "any", \ "anyone", "anything", "anywhere", "are", "around", "as", "at", \ "back", "back-to-back", "be", "because", "become", "before", \ "behind", "being", "below", "between", "born-again", "both", \ "built-in", "but", "by", "can-do", "custom-made", "do", "done", \ "down", "during", "each", "either", "enough", "even", "ever", \ "every", "everyone", "everything", "everywhere", "far-off", \ "far-out", "few", "find", "first", "for", "four", "from", "full", \ "further", "get", "give", "go", "have-not", "he", "head-on", "her", \ "here", "hers", "herself", "him", "himself", "his", "how", "however", \ "if", "in", "interest", "into", "it", "its", "itself", "keep", \ "laid-back", "last", "least", "less", "ma'am", "made", "man-made", \ "many", "may", "me", "might", "more", "most", "mostly", "much", \ "must", "my", "myself", "never", "next", "nobody", "nor", "no-show", \ "not", "nothing", "now", "nowhere", "of", "off", "often", "on", \ "once", "one", "only", "other", "our", "ours", "ourselves", "out", \ "over", "own", "part", "per", "perhaps", "put", "rather", \ "runner-up", "same", "seem", "seeming", "see-through", \ "self-interest", "self-made", "several", "she", "show", "side", \ "since", "sit-in", "so", "some", "someone", "something", "somewhere", \ "still", "such", "take", "than", "that", "the", "their", "theirs", \ "them", "themselves", "then", "there", "therefore", "these", "they", \ "this", "those", "though", "three", "through", "thus", "to", \ "together", "too", "toward", "two", "under", "until", "up", "upon", \ "us", "very", "we", "well", "well-to-do", "what", "when", "where", \ "where's", "whether", "which", "while", "who", "whole", "whom", \ "whose", "why", "will", "with", "within", "without", "would-be", \ "write-off", "yet", "you", "your", "yours", "yourself"}
I agree with some of those stopwords, but not really any of them that contain the -
character. This is perhaps where the issue lies.
What appears to be happening is that DeleteStopwords
is deleting part of some words, and what's left over is counted. We can see the outcome:
Counts[TextWords[txt]]["far"]
19
Counts[TextWords[DeleteStopwords[txt]]]["far"]
39
We can see that this behaviour is weird by comparing the following:
Select[TextWords[txt], StringStartsQ["far"]] // Counts // ReverseSort
<|"farinaceous" -> 19, "far" -> 19, "fare" -> 19, "faro" -> 18, "farther" -> 17, "farmer" -> 17, "farcical" -> 17, "farthing" -> 17, "faraway" -> 16, "farmstead" -> 16, "farrier" -> 15, "farthermost" -> 14, "far-off" -> 14, "farming" -> 14, "farrago" -> 13, "farm" -> 13, "farcically" -> 13, "farrowing" -> 12, "farce" -> 11, "farsighted" -> 11, "farmland" -> 10, "farsightedness" -> 10, "farmhouse" -> 9, "farseeing" -> 9, "farad" -> 8, "farina" -> 8, "farthest" -> 8, "farmhand" -> 7, "farewell" -> 7, "farrow" -> 6, "farmyard" -> 6, "far-out" -> 6|>
Select[TextWords[DeleteStopwords@txt], StringStartsQ["far"]] // Counts // ReverseSort
<|"far" -> 39, "farinaceous" -> 19, "fare" -> 19, "faro" -> 18, "farther" -> 17, "farmer" -> 17, "farcical" -> 17, "farthing" -> 17, "faraway" -> 16, "farmstead" -> 16, "farrier" -> 15, "farthermost" -> 14, "farming" -> 14, "farrago" -> 13, "farm" -> 13, "farcically" -> 13, "farrowing" -> 12, "farce" -> 11, "farsighted" -> 11, "farmland" -> 10, "farsightedness" -> 10, "farmhouse" -> 9, "farseeing" -> 9, "farad" -> 8, "farina" -> 8, "farthest" -> 8, "farmhand" -> 7, "farewell" -> 7, "farrow" -> 6, "farmyard" -> 6|>
Here we can see that DeleteStopwords
is replacing "far-out" and "far-off" with "far-", which is segmented to "far" by TextWords
, which completely throws off WordCloud's counting mechanism in this case.
As already noted by Carl, you have blamed the wrong function. Had you imported the text file in the proper format, you would have gotten the expected results:
words = Import["https://pastebin.com/raw/Z0hd3huU", "List"];
AllTrue[words, StringQ]
True
Take[words, 10]
{"inconsiderate", "weighting", "unneeded", "issuing", "intemperately", "perverse",
"disgruntled", "ninja", "artificially", "seduce"}
Note that this import format already yielded a list of strings as opposed to a single string.
Commonest[words, 10]
{"affirm", "calligrapher", "squander", "validly", "autoimmune", "equation", "nematode",
"veronica", "crispness", "ashen"}
TakeLargest[Counts[words], 10]
<|"affirm" -> 29, "equation" -> 28, "ashen" -> 28, "crispness" -> 28, "veronica" -> 28,
"squander" -> 27, "validly" -> 27, "calligrapher" -> 27, "autoimmune" -> 27,
"nematode" -> 27|>
WordCloud[words]