Space-Efficient Data Structure for Storing a Word List?

One structure I saw for minimizing space in a spelling dictionary was to encode each word as:

  • the number of characters (a byte) in common with the last; and
  • the new ending.

So the word list

HERE            would encode as    THIS
sanctimonious                      0,sanctimonious
sanction                           6,on
sanguine                           3,guine
trivial                            0,trivial

You're saving 7 bytes straight up there (19%), I suspect the saving would be similar for a 20,000 word dictionary just due to the minimum distances between (common prefixes of) adjacent words.

For example, I ran a test program over a sorted dictionary and calculated old storage space as the word length plus one (for terminator), and the new storage space as one for the common length, the uncommon suffix length and one for a terminator. Here's the final part of that test program showing that you could well over 50%:

zwiebacks   -> zygote      common=        old=1044662 new=469762 55.0%
zygote      -> zygotes     common=zygote  old=1044670 new=469765 55.0%
zygotes     -> zygotic     common=zygot   old=1044678 new=469769 55.0%
zygotic     -> zymase      common=zy      old=1044685 new=469775 55.0%
zymase      -> zymogenic   common=zym     old=1044695 new=469783 55.0%
zymogenic   -> zymology    common=zymo    old=1044704 new=469789 55.0%
zymology    -> zymolysis   common=zymol   old=1044714 new=469795 55.0%
zymolysis   -> zymoplastic common=zymo    old=1044726 new=469804 55.0%
zymoplastic -> zymoscope   common=zymo    old=1044736 new=469811 55.0%
zymoscope   -> zymurgy     common=zym     old=1044744 new=469817 55.0%
zymurgy     -> zyzzyva     common=zy      old=1044752 new=469824 55.0%
zyzzyva     -> zyzzyvas    common=zyzzyva old=1044761 new=469827 55.0%

To speed lookup, there was a 26-entry table in memory which held the starting offsets for words beginning with a, b, c, ..., z. The words at these offsets always had 0 as the first byte as they had no letters in common with the previous word.

This seems to be sort of a trie but without the pointers, which would surely get space-expensive if every character in the tree had a 4-byte pointer associated with it.

Mind you, this was from my CP/M days where memory was much scarcer than it is now.


A Patricia trie may be more appropriate:

http://en.wikipedia.org/wiki/Patricia_tree

My (fuzzy) memory tells me there were used in some of the early full-text search engines ...

Paul.


What are you doing? If it's spell checking, you could use a bloom filter - see this code kata.