Detect most likely words from text without spaces / combined words

Here's a dynamic programming solution (implemented as a memoized function). Given a dictionary of words with their frequencies, it splits the input text at the positions that give the overall most likely phrase. You'll have to find a real wordlist, but I included some made-up frequencies for a simple test.

WORD_FREQUENCIES = {
    'file': 0.00123,
    'files': 0.00124,
    'save': 0.002,
    'ave': 0.00001,
    'as': 0.00555
}

def split_text(text, word_frequencies, cache):
    if text in cache:
        return cache[text]
    if not text:
        return 1, []
    best_freq, best_split = 0, []
    for i in xrange(1, len(text) + 1):
        word, remainder = text[:i], text[i:]
        freq = word_frequencies.get(word, None)
        if freq:
            remainder_freq, remainder = split_text(
                    remainder, word_frequencies, cache)
            freq *= remainder_freq
            if freq > best_freq:
                best_freq = freq
                best_split = [word] + remainder
    cache[text] = (best_freq, best_split)
    return cache[text]

print split_text('filesaveas', WORD_FREQUENCIES, {})

--> (1.3653e-08, ['file', 'save', 'as'])

I don't know of any library for it, but it shouldn't be hard to implement basic functionality.

  1. Get a words list, like UNIX's words.
  2. Stuff the contents of your word list into a trie.
  3. Take the string you want to split and follow its path in the trie. Each time you reach a valid word, create a new branch that searches for a word starting from the point of the string that you got to. Once you finish your current branch, backtrack to the one you created, like in a depth first search.
  4. Disambiguate the resulting lists manually, using heuristics or through a natural language parser.

Example:

  1. Word: "filesaveasstring"
  2. First valid word is "file". Try matching "saveas". First valid word is "save". Try matching "asstring". First valid word is "as". Try matching "string". First valid word is "string". Matched until end; put the [file save as string] into your results list.
  3. Backtrack to matching "string" - no other possibilities. Backtrack to "asstring". First unvisited valid word is "ass". Try matching "tring". No possible matches. Backtrack to "asstring". No possible matches. Backtrack to "filesaveasstring".
  4. First unvisited match is "files". Try to match "aveasstring". First match is "ave". Try matching "asstring" (same results as steps 2/3), adding [files ave as string] to your results list and backtrack to the start.
  5. Try matching "filesaveasstring". No unvisited matches. Done.
  6. Select the most likely from [[file save as string] [files ave as string]] using a heuristic or a natural language parser.

Tags:

Python

Split

Word