How to create word clouds?

Here's what I came up with

Mathematica graphics

How I did it

First we need a list of words. Here, I've taken the original list ordered by size.

tally = Tally@
   Cases[StringSplit[ExampleData[{"Text", "AliceInWonderland"}], 
     Except@LetterCharacter], _?(StringLength@# > 4 \[And] # =!= 
         "Alice" &)];
tally = Cases[tally, _?(Last@# > 10 &)];
tally = Reverse@SortBy[tally, Last];
range = {Min@(Last /@ tally), Max@(Last /@ tally)};

words = Style[First@#, FontFamily -> "Cracked", FontWeight -> Bold, 
     FontColor -> 
      Hue[RandomReal[], RandomReal[{.5, 1}], RandomReal[{.5, 1}]], 
     FontSize -> Last@Rescale[#, range, {12, 70}]] & /@ tally;

The words are rasterised and cropped to make sure the bounding box is as tight as possible.

Click to copy

wordsimg = ImageCrop[Image[Graphics[Text[#]]]] & /@ words;

To produce the image the words are added one by one using a Fold loop where the next word is placed as close to the centre of the existing image as possible. This is done by applying a max filter to the binarized version of the original image thus turning forbidden pixels white and looking for the black point that is closest to the centre of the image.

Click to copy

iteration[img1_, w_, fun_: (Norm[#1 - #2] &)] := 
 Module[{imdil, centre, diff, dimw, padding, padded1, minpos},
  dimw = ImageDimensions[w];
  padded1 = ImagePad[img1, {dimw[[1]] {1, 1}, dimw[[2]] {1, 1}}, 1];
  
  imdil = MaxFilter[Binarize[ColorNegate[padded1], 0.01], 
    Reverse@Floor[dimw/2 + 2]];
  centre = ImageDimensions[padded1]/2;
  
  minpos = Reverse@Nearest[Position[Reverse[ImageData[imdil]], 0], 
      Reverse[centre], DistanceFunction -> fun][[1]];
  diff = ImageDimensions[imdil] - dimw;
  padding[pos_] := Transpose[{#, diff - #} &@Round[pos - dimw/2]];
  
  ImagePad[#, (-Min[#] {1, 1 }) & /@ BorderDimensions[#]] &@
   ImageMultiply[padded1, ImagePad[w, padding[minpos], 1]]]

Fold[iteration, wordsimg[[1]], Rest[wordsimg]]

You can play around with the distance function. For example for a distance function

Click to copy

fun = Norm[{1, 1/2} (#2 - #1)] &

you get an ellipsoidal shape:

Click to copy

Fold[iteration[##, fun]&, wordsimg[[1]], Rest[wordsimg]]

Mathematica graphics

Updated version

The previous code places new words in the image by approximating them with rectangles. This works fine for horizontally or vertically oriented words, but not so well for rotated words or more general shapes. Luckily, the code can be easily modified to deal with this by replacing the MaxFilter with a ImageCorrelate:

Click to copy

iteration2[img1_, w_, fun_: ( Norm[#1 - #2] &)] := 
 Module[{imdil, centre, diff, dimw, padding, padded1, minpos}, 
  dimw = ImageDimensions[w];
  padded1 = ImagePad[img1, {dimw[[1]] {1, 1}, dimw[[2]] {1, 1}}, 1];
  imdil = Binarize[ImageCorrelate[Binarize[ColorNegate[padded1], 0.05], 
     Dilation[Binarize[ColorNegate[w], .05], 1]]];
  centre = ImageDimensions[padded1]/2;
  minpos = 
   Reverse@Nearest[Position[Reverse[ImageData[imdil]], 0], 
      Reverse[centre], DistanceFunction -> fun][[1]];
  Sow[minpos - centre]; (* for creating vector plot *)
  diff = ImageDimensions[imdil] - dimw;
  padding[pos_] := Transpose[{#, diff - #} &@Round[pos - dimw/2]];
  ImagePad[#, (-Min[#] {1, 1}) & /@ BorderDimensions[#]] &@
   ImageMultiply[padded1, ImagePad[w, padding[minpos], 1]]]

To test this code we use a list of rotated words. Note that I'm using ImagePad instead of ImageCrop to crop the images. This is because ImageCrop seems to clip the words sometimes.

Click to copy

words = Style[First@#, FontFamily -> "Times", 
     FontColor -> 
      Hue[RandomReal[], RandomReal[{.5, 1}], RandomReal[{.5, 1}]], 
     FontSize -> (Last@Rescale[#, range, {12, 150}])] & /@ tally;

wordsimg = ImagePad[#, -3 - 
  BorderDimensions[#]] & /@ (Image[
   Graphics[Text[Framed[#, FrameMargins -> 2]]]] & /@ words);

wordsimgRot = ImageRotate[#, RandomReal[2 Pi], 
  Background -> White] & /@ wordsimg;

The iteration loop is as before:

Click to copy

Fold[iteration2, wordsimgRot[[1]], Rest[wordsimgRot]]

which produces

Mathematica graphics

Second update

To create a vector graphics of the previous result, we need to save the positions of the words in the image, for example by adding Sow[minpos - centre] to the definition of iteration2 somewhere towards the end of the code and using Reap to reap the results. We also need to keep the rotation angles of the words, so we'll replace wordsimgRot with

Click to copy

angles = RandomReal[2 Pi, Length[wordsimg]];

wordsimgRot = ImageRotate[##, Background -> White] & @@@ 
   Transpose[{wordsimg, angles}];

As mentioned before, we use Reap to create the position list

Click to copy

poslist = Reap[img = Fold[iteration2, wordsimgRot[[1]], 
  Rest[wordsimgRot]];][[2, 1]]

The vector graphics can then be created with

Click to copy

Graphics[MapThread[Text[#1, Offset[#2, {0, 0}], {0, 0}, {Cos[#3], Sin[#3]}] &,
  {words, Prepend[poslist, {0, 0}], angles}]]

A preview

Before I show any code, here's a preview of what is possible with some tweaking:

Mathematica graphics

First try

Here's a go at implementing Wordle's layout algorithm, described at cormullion's link.

First, let's generate the word data (this is pretty arbitrary):

Click to copy

punctuation = ",/.<>?;':\"()-_!&"

(* boring words: *)
common = {"the", "of", "and", "to", "in", "I", "that", "was", "his", 
   "he", "it", "with", "is", "for", "as", "had", "you", "not", "be", 
   "her", "on", "at", "by", "which", "have", "or", "from", "this", 
   "him", "but", "all", "she", "they", "were", "my", "are", "me", 
   "one", "their", "so", "an", "said", "them", "we", "who", "would", 
   "been", "will", "no", "when", "there", "if", "more", "out", "up", 
   "into", "do", "any", "your", "what", "has", "man", "could", 
   "other", "than", "our", "some", "very", "time", "upon", "about", 
   "may", "its", "only", "now", "like", "little", "then", "can", 
   "should", "made", "did", "us", "such", "a", "great", "before", 
   "must", "two", "these", "see", "know", "over", "much", "down", 
   "after", "first", "mr", "good", "men"};

text = Select[
   StringSplit@
    StringReplace[ExampleData[{"Text", "AliceInWonderland"}], 
     Alternatives @@ Characters[punctuation] -> " "],
   StringLength[#] > 2 &
   ];
text = DeleteCases[text, w_ /; MemberQ[common, ToLowerCase[w]]];

Now that we have the data, let's take a word tally and generate words at sized proportional to their frequency:

Click to copy

words = TakeWhile[Reverse@SortBy[Tally[text], Last], #[[2]] >= 10 &];

styledwords = 
 Style[#1, FontSize -> #2, FontFamily -> "Times"] & @@@ words

Mathematica graphics

Let's rasterize and binarize these (the binarization is to ease overlap detection):

Click to copy

images = Binarize@Rasterize[#, "Image"] & /@ styledwords;

This counts black pixels in an image:

Click to copy

count[img_] := ImageLevels[Binarize[img]][[1, 2]]

Now run this:

Click to copy

canvas = Image[Graphics[], ImageSize -> {1000, 1000}];
Monitor[
 Do[
  x = 0;
  w = images[[i]];
  cc = count[canvas];
  centre = RandomReal[0.1 {-1, 1}, {2}] + {0.5, 0.5};
  compose := 
   res = ImageCompose[canvas, SetAlphaChannel[w, ColorNegate[w]], 
     Scaled[centre + x/100 {Cos[x], Sin[x]}]];
  compose;
  While[count[res] - cc - count[w] < 0 && x < 80,
   x += 1;
   compose;
   ];
  canvas = res,
  
  {i, 1, Length[images]}
  ],
 canvas
 ]

canvas // ImageCrop

Mathematica graphics

Some explanations:

This code takes the word images one by one and tries to place them at a random position. If it does not fit at that position, it tries other positions, moving on an outward winding Archimedean spiral. The step size of moving on this spiral should be chosen so that the points are distributed with an approximately constant density in the plane, and are not on gathered on a few thin lines. I used this code to verify the point distribution for a given step size (1 was good enough for a first try):

Click to copy

Manipulate[Graphics[Point@Table[x/100 {Cos[x], Sin[x]}, {x, 0, 100, s}]], 
   {s, 0.1, 1.5}]

Testing for image overlap (i.e. whether the word fits) is done by composing the image onto a canvas which has all the previously placed words, and verifying that no black pixels will collide (i.e. the total pixel count will not change after placing the image). There are probably faster ways to do this.

Second try: converting all this to vector graphics

I used images but his should only be an aid for calculating positions. The images should be dilated prior to arranging them (to have more "air" between them), their positions should be recorded, and the recorded positions should be used to arrange the vector versions of words.

Here's a basic implementation:

Click to copy

images = ColorNegate@Dilation[ColorNegate[#], 2] & /@ images

canvas = Image[Graphics[], ImageSize -> {1000, 1000}];
positions = {};
Monitor[
 Do[
  x = 0;
  w = images[[i]];
  cc = count[canvas];
  centre = RandomReal[0.1 {-1, 1}, {2}] + {0.5, 0.5};
  compose := 
   res = ImageCompose[canvas, SetAlphaChannel[w, ColorNegate[w]], 
     pos = Scaled[centre + x/100 {Cos[x], Sin[x]}]];
  compose;
  While[count[res] - cc - count[w] < 0 && x < 80,
   x += 1;
   compose;
   ];
  canvas = res;
  AppendTo[positions, pos],
  
  {i, 1, Length[images]}
  ],
 canvas
 ]

Rasterize[
  Graphics[MapThread[Text, {styledwords, positions}], 
   ImageSize -> 1000], "Image"] // ImageCrop

Mathematica graphics

There's a lot of refinement needed though (I didn't have time to tune the parameters).

Now that two of our resident Mathematica geniuses (genii?) have produced such awesome examples, there's not much room left for anyone else... :) But that didn't stop me - and I'm here to make you guys look good. I had an idea...

I decided not to make a cloud, but a tale - or rather, a tail. I've pinched Szabolcs's code to get the words and frequencies:

Click to copy

punctuation = ",/.<>?;':\"()-_!&";
common = {"the", "of", "and", "to", "in", "I", "that", "was", "his", 
 "he", "it", "with", "is", "for", "as", "had", "you", "not", "be", 
  "her", "on", "at", "by", "which", "have", "or", "from", "this", 
  "him", "but", "all", "she", "they", "were", "my", "are", "me", 
  "one", "their", "so", "an", "said", "them", "we", "who", "would", 
  "been", "will", "no", "when", "there", "if", "more", "out", "up", 
  "into", "do", "any", "your", "what", "has", "man", "could", "other",
   "than", "our", "some", "very", "time", "upon", "about", "may", 
  "its", "only", "now", "like", "little", "then", "can", "should", 
  "made", "did", "us", "such", "a", "great", "before", "must", "two", 
  "these", "see", "know", "over", "much", "down", "after", "first", 
  "mr", "good", "men"}; text = 
 Select[StringSplit@
   StringReplace[ExampleData[{"Text", "AliceInWonderland"}], 
    Alternatives @@ Characters[punctuation] -> " "], 
  StringLength[#] > 2 &];
text = DeleteCases[text, w_ /; MemberQ[common, ToLowerCase[w]]];
words = TakeWhile[Reverse@SortBy[Tally[text], Last], #[[2]] >= 10 &];

Now, I can work out how big the words would be:

Click to copy

textToOutline[string_, size_] := 
  First[ImportString[
    ExportString[
     Style[string, FontFamily -> "Times", FontWeight -> "Bold", 
      FontSize -> size], "PDF"], "TextMode" -> "Outlines"]];

width[t_] := t[[2]][[2]][[1]]; (* the best way? - surely not! *)
height[t_] := t[[2]][[2]][[2]];

and make the graphics:

Click to copy

wordGraphics =  textToOutline[First[#], Last[#]] & /@ words;
baselines =   Reverse[Accumulate [Reverse[height /@ wordGraphics]]]; (* !? *)
wiggle = 0; fudgeFactor = 180; 
Graphics[
 MapThread[
   Tooltip[Text[#1, {Sin[++wiggle/4], #2/fudgeFactor}], #3] &, 
   {wordGraphics,  baselines, words}], 
 Background -> LightGreen]

With the help of a fudge factor or two, it looks like this:

word tail

There's a number of problems with this code - I don't see how to do some things in Mathematica yet - such as find out how far I've got in a Map operation, or how the coordinates work (what is that fudge factor doing?), or, more importantly, why the first word doesn't look right. But I like the idea of reading the tale.

See The Mouse's tale - Wikipedia

How to create word clouds?

Updated version

Second update

A preview

First try

Second try: converting all this to vector graphics

Tags:

Graphics

Computational Geometry

Packing

Word Cloud

Generative Art

Related

Recent Posts