How to create word clouds?
Here's what I came up with
How I did it
First we need a list of words. Here, I've taken the original list ordered by size.
tally = Tally@
Cases[StringSplit[ExampleData[{"Text", "AliceInWonderland"}],
Except@LetterCharacter], _?(StringLength@# > 4 \[And] # =!=
"Alice" &)];
tally = Cases[tally, _?(Last@# > 10 &)];
tally = Reverse@SortBy[tally, Last];
range = {Min@(Last /@ tally), Max@(Last /@ tally)};
words = Style[First@#, FontFamily -> "Cracked", FontWeight -> Bold,
FontColor ->
Hue[RandomReal[], RandomReal[{.5, 1}], RandomReal[{.5, 1}]],
FontSize -> Last@Rescale[#, range, {12, 70}]] & /@ tally;
The words are rasterised and cropped to make sure the bounding box is as tight as possible.
wordsimg = ImageCrop[Image[Graphics[Text[#]]]] & /@ words;
To produce the image the words are added one by one using a Fold
loop where the next word is placed as close to the centre of the existing image as possible. This is done by applying a max filter to the binarized version of the original image thus turning forbidden pixels white and looking for the black point that is closest to the centre of the image.
iteration[img1_, w_, fun_: (Norm[#1 - #2] &)] :=
Module[{imdil, centre, diff, dimw, padding, padded1, minpos},
dimw = ImageDimensions[w];
padded1 = ImagePad[img1, {dimw[[1]] {1, 1}, dimw[[2]] {1, 1}}, 1];
imdil = MaxFilter[Binarize[ColorNegate[padded1], 0.01],
Reverse@Floor[dimw/2 + 2]];
centre = ImageDimensions[padded1]/2;
minpos = Reverse@Nearest[Position[Reverse[ImageData[imdil]], 0],
Reverse[centre], DistanceFunction -> fun][[1]];
diff = ImageDimensions[imdil] - dimw;
padding[pos_] := Transpose[{#, diff - #} &@Round[pos - dimw/2]];
ImagePad[#, (-Min[#] {1, 1 }) & /@ BorderDimensions[#]] &@
ImageMultiply[padded1, ImagePad[w, padding[minpos], 1]]]
Fold[iteration, wordsimg[[1]], Rest[wordsimg]]
You can play around with the distance function. For example for a distance function
fun = Norm[{1, 1/2} (#2 - #1)] &
you get an ellipsoidal shape:
Fold[iteration[##, fun]&, wordsimg[[1]], Rest[wordsimg]]
Updated version
The previous code places new words in the image by approximating them with rectangles. This works fine for horizontally or vertically oriented words, but not so well for rotated words or more general shapes. Luckily, the code can be easily modified to deal with this by replacing the MaxFilter
with a ImageCorrelate
:
iteration2[img1_, w_, fun_: ( Norm[#1 - #2] &)] :=
Module[{imdil, centre, diff, dimw, padding, padded1, minpos},
dimw = ImageDimensions[w];
padded1 = ImagePad[img1, {dimw[[1]] {1, 1}, dimw[[2]] {1, 1}}, 1];
imdil = Binarize[ImageCorrelate[Binarize[ColorNegate[padded1], 0.05],
Dilation[Binarize[ColorNegate[w], .05], 1]]];
centre = ImageDimensions[padded1]/2;
minpos =
Reverse@Nearest[Position[Reverse[ImageData[imdil]], 0],
Reverse[centre], DistanceFunction -> fun][[1]];
Sow[minpos - centre]; (* for creating vector plot *)
diff = ImageDimensions[imdil] - dimw;
padding[pos_] := Transpose[{#, diff - #} &@Round[pos - dimw/2]];
ImagePad[#, (-Min[#] {1, 1}) & /@ BorderDimensions[#]] &@
ImageMultiply[padded1, ImagePad[w, padding[minpos], 1]]]
To test this code we use a list of rotated words. Note that I'm using ImagePad
instead of ImageCrop
to crop the images. This is because ImageCrop
seems to clip the words sometimes.
words = Style[First@#, FontFamily -> "Times",
FontColor ->
Hue[RandomReal[], RandomReal[{.5, 1}], RandomReal[{.5, 1}]],
FontSize -> (Last@Rescale[#, range, {12, 150}])] & /@ tally;
wordsimg = ImagePad[#, -3 -
BorderDimensions[#]] & /@ (Image[
Graphics[Text[Framed[#, FrameMargins -> 2]]]] & /@ words);
wordsimgRot = ImageRotate[#, RandomReal[2 Pi],
Background -> White] & /@ wordsimg;
The iteration loop is as before:
Fold[iteration2, wordsimgRot[[1]], Rest[wordsimgRot]]
which produces
Second update
To create a vector graphics of the previous result, we need to save the positions of the words in the image, for example by adding Sow[minpos - centre]
to the definition of iteration2
somewhere towards the end of the code and using Reap
to reap the results. We also need to keep the rotation angles of the words, so we'll replace wordsimgRot
with
angles = RandomReal[2 Pi, Length[wordsimg]];
wordsimgRot = ImageRotate[##, Background -> White] & @@@
Transpose[{wordsimg, angles}];
As mentioned before, we use Reap
to create the position list
poslist = Reap[img = Fold[iteration2, wordsimgRot[[1]],
Rest[wordsimgRot]];][[2, 1]]
The vector graphics can then be created with
Graphics[MapThread[Text[#1, Offset[#2, {0, 0}], {0, 0}, {Cos[#3], Sin[#3]}] &,
{words, Prepend[poslist, {0, 0}], angles}]]
A preview
Before I show any code, here's a preview of what is possible with some tweaking:
First try
Here's a go at implementing Wordle's layout algorithm, described at cormullion's link.
First, let's generate the word data (this is pretty arbitrary):
punctuation = ",/.<>?;':\"()-_!&"
(* boring words: *)
common = {"the", "of", "and", "to", "in", "I", "that", "was", "his",
"he", "it", "with", "is", "for", "as", "had", "you", "not", "be",
"her", "on", "at", "by", "which", "have", "or", "from", "this",
"him", "but", "all", "she", "they", "were", "my", "are", "me",
"one", "their", "so", "an", "said", "them", "we", "who", "would",
"been", "will", "no", "when", "there", "if", "more", "out", "up",
"into", "do", "any", "your", "what", "has", "man", "could",
"other", "than", "our", "some", "very", "time", "upon", "about",
"may", "its", "only", "now", "like", "little", "then", "can",
"should", "made", "did", "us", "such", "a", "great", "before",
"must", "two", "these", "see", "know", "over", "much", "down",
"after", "first", "mr", "good", "men"};
text = Select[
StringSplit@
StringReplace[ExampleData[{"Text", "AliceInWonderland"}],
Alternatives @@ Characters[punctuation] -> " "],
StringLength[#] > 2 &
];
text = DeleteCases[text, w_ /; MemberQ[common, ToLowerCase[w]]];
Now that we have the data, let's take a word tally and generate words at sized proportional to their frequency:
words = TakeWhile[Reverse@SortBy[Tally[text], Last], #[[2]] >= 10 &];
styledwords =
Style[#1, FontSize -> #2, FontFamily -> "Times"] & @@@ words
Let's rasterize and binarize these (the binarization is to ease overlap detection):
images = Binarize@Rasterize[#, "Image"] & /@ styledwords;
This counts black pixels in an image:
count[img_] := ImageLevels[Binarize[img]][[1, 2]]
Now run this:
canvas = Image[Graphics[], ImageSize -> {1000, 1000}];
Monitor[
Do[
x = 0;
w = images[[i]];
cc = count[canvas];
centre = RandomReal[0.1 {-1, 1}, {2}] + {0.5, 0.5};
compose :=
res = ImageCompose[canvas, SetAlphaChannel[w, ColorNegate[w]],
Scaled[centre + x/100 {Cos[x], Sin[x]}]];
compose;
While[count[res] - cc - count[w] < 0 && x < 80,
x += 1;
compose;
];
canvas = res,
{i, 1, Length[images]}
],
canvas
]
canvas // ImageCrop
Some explanations:
This code takes the word images one by one and tries to place them at a random position. If it does not fit at that position, it tries other positions, moving on an outward winding Archimedean spiral. The step size of moving on this spiral should be chosen so that the points are distributed with an approximately constant density in the plane, and are not on gathered on a few thin lines. I used this code to verify the point distribution for a given step size (1 was good enough for a first try):
Manipulate[Graphics[Point@Table[x/100 {Cos[x], Sin[x]}, {x, 0, 100, s}]],
{s, 0.1, 1.5}]
Testing for image overlap (i.e. whether the word fits) is done by composing the image onto a canvas which has all the previously placed words, and verifying that no black pixels will collide (i.e. the total pixel count will not change after placing the image). There are probably faster ways to do this.
Second try: converting all this to vector graphics
I used images but his should only be an aid for calculating positions. The images should be dilated prior to arranging them (to have more "air" between them), their positions should be recorded, and the recorded positions should be used to arrange the vector versions of words.
Here's a basic implementation:
images = ColorNegate@Dilation[ColorNegate[#], 2] & /@ images
canvas = Image[Graphics[], ImageSize -> {1000, 1000}];
positions = {};
Monitor[
Do[
x = 0;
w = images[[i]];
cc = count[canvas];
centre = RandomReal[0.1 {-1, 1}, {2}] + {0.5, 0.5};
compose :=
res = ImageCompose[canvas, SetAlphaChannel[w, ColorNegate[w]],
pos = Scaled[centre + x/100 {Cos[x], Sin[x]}]];
compose;
While[count[res] - cc - count[w] < 0 && x < 80,
x += 1;
compose;
];
canvas = res;
AppendTo[positions, pos],
{i, 1, Length[images]}
],
canvas
]
Rasterize[
Graphics[MapThread[Text, {styledwords, positions}],
ImageSize -> 1000], "Image"] // ImageCrop
There's a lot of refinement needed though (I didn't have time to tune the parameters).
Now that two of our resident Mathematica geniuses (genii?) have produced such awesome examples, there's not much room left for anyone else... :) But that didn't stop me - and I'm here to make you guys look good. I had an idea...
I decided not to make a cloud, but a tale - or rather, a tail. I've pinched Szabolcs's code to get the words and frequencies:
punctuation = ",/.<>?;':\"()-_!&";
common = {"the", "of", "and", "to", "in", "I", "that", "was", "his",
"he", "it", "with", "is", "for", "as", "had", "you", "not", "be",
"her", "on", "at", "by", "which", "have", "or", "from", "this",
"him", "but", "all", "she", "they", "were", "my", "are", "me",
"one", "their", "so", "an", "said", "them", "we", "who", "would",
"been", "will", "no", "when", "there", "if", "more", "out", "up",
"into", "do", "any", "your", "what", "has", "man", "could", "other",
"than", "our", "some", "very", "time", "upon", "about", "may",
"its", "only", "now", "like", "little", "then", "can", "should",
"made", "did", "us", "such", "a", "great", "before", "must", "two",
"these", "see", "know", "over", "much", "down", "after", "first",
"mr", "good", "men"}; text =
Select[StringSplit@
StringReplace[ExampleData[{"Text", "AliceInWonderland"}],
Alternatives @@ Characters[punctuation] -> " "],
StringLength[#] > 2 &];
text = DeleteCases[text, w_ /; MemberQ[common, ToLowerCase[w]]];
words = TakeWhile[Reverse@SortBy[Tally[text], Last], #[[2]] >= 10 &];
Now, I can work out how big the words would be:
textToOutline[string_, size_] :=
First[ImportString[
ExportString[
Style[string, FontFamily -> "Times", FontWeight -> "Bold",
FontSize -> size], "PDF"], "TextMode" -> "Outlines"]];
width[t_] := t[[2]][[2]][[1]]; (* the best way? - surely not! *)
height[t_] := t[[2]][[2]][[2]];
and make the graphics:
wordGraphics = textToOutline[First[#], Last[#]] & /@ words;
baselines = Reverse[Accumulate [Reverse[height /@ wordGraphics]]]; (* !? *)
wiggle = 0; fudgeFactor = 180;
Graphics[
MapThread[
Tooltip[Text[#1, {Sin[++wiggle/4], #2/fudgeFactor}], #3] &,
{wordGraphics, baselines, words}],
Background -> LightGreen]
With the help of a fudge factor or two, it looks like this:
There's a number of problems with this code - I don't see how to do some things in Mathematica yet - such as find out how far I've got in a Map
operation, or how the coordinates work (what is that fudge factor doing?), or, more importantly, why the first word doesn't look right. But I like the idea of reading the tale.
See The Mouse's tale - Wikipedia