Split a sentence into separate words
You might want to consider using a trie data structure. You first construct the trie from the dictionary then searching for valid words will be much faster. The advantage is determining if you are at the end of a word or need to continue looking for longer words is very fast.
I do realize that the chinese word segmentation problem is a very complex one, but in some cases this trivial algorithm may be sufficient: search the longest word w starting with the ith character, then start again for the i+length(w)th character.
Here's a Python implementation:
#!/usr/bin/env python
# encoding: utf-8
import re
import unicodedata
import codecs
class ChineseDict:
def __init__(self,lines,rex):
self.words = set(rex.match(line).group(1) for line in lines if not line.startswith("#"))
self.maxWordLength = max(map(len,self.words))
def segmentation(self,text):
result = []
previousIsSticky = False
i = 0
while i < len(text):
for j in range(i+self.maxWordLength,i,-1):
s = text[i:j]
if s in self.words:
break
sticky = len(s)==1 and unicodedata.category(s)!="Lo"
if previousIsSticky or (result and sticky):
result[-1] += s
else:
result.append(s)
previousIsSticky = sticky
i = j
return u" | ".join(result)
def genWords(self,text):
i = 0
while i < len(text):
for j in range(i+self.maxWordLength,i,-1):
s = text[i:j]
if s in self.words:
yield s
break
i = j
if __name__=="__main__":
cedict = ChineseDict(codecs.open("cedict_ts.u8",'r','utf-8'),re.compile(r"(?u)^.+? (.+?) .+"))
text = u"""33. 你可以叫我夏尔
戴高乐将军和夫人在科隆贝双教堂村过周末。星期日早晨,伊冯娜无意中走进浴室,正巧将军在洗盆浴。她感到非常意外,不禁大叫一声:“我的上帝!”
戴高乐于是转过身,看见妻子因惊魂未定而站立在门口。他继续用香皂擦身,不紧不慢地说:“伊冯娜,你知道,如果是我们之间的隐私,你可以叫我夏尔,用不着叫我上帝……”
"""
print cedict.segmentation(text)
print u" | ".join(cedict.genWords(text))
The last part uses a copy of the CCEDICT dictionary to segment a (simplified) chinese text in two flavours (resp., with and without non-word characters):
33. 你 | 可以 | 叫 | 我 | 夏 | 尔
戴高乐 | 将军 | 和 | 夫人 | 在 | 科隆 | 贝 | 双 | 教堂 | 村 | 过 | 周末。星期日 | 早晨,伊 | 冯 | 娜 | 无意中 | 走进 | 浴室,正巧 | 将军 | 在 | 洗 | 盆浴。她 | 感到 | 非常 | 意外,不禁 | 大 | 叫 | 一声:“我的 | 上帝!”
戴高乐 | 于是 | 转 | 过 | 身,看见 | 妻子 | 因 | 惊魂 | 未定 | 而 | 站立 | 在 | 门口。他 | 继续 | 用 | 香皂 | 擦 | 身,不 | 紧 | 不 | 慢 | 地 | 说:“伊 | 冯 | 娜,你 | 知道,如果 | 是 | 我们 | 之间 | 的 | 隐私,你 | 可以 | 叫 | 我 | 夏 | 尔,用不着 | 叫 | 我 | 上帝……”
你 | 可以 | 叫 | 我 | 夏 | 尔 | 戴高乐 | 将军 | 和 | 夫人 | 在 | 科隆 | 贝 | 双 | 教堂 | 村 | 过 | 周末 | 星期日 | 早晨 | 伊 | 冯 | 娜 | 无意中 | 走进 | 浴室 | 正巧 | 将军 | 在 | 洗 | 盆浴 | 她 | 感到 | 非常 | 意外 | 不禁 | 大 | 叫 | 一声 | 我的 | 上帝 | 戴高乐 | 于是 | 转 | 过 | 身 | 看见 | 妻子 | 因 | 惊魂 | 未定 | 而 | 站立 | 在 | 门口 | 他 | 继续 | 用 | 香皂 | 擦 | 身 | 不 | 紧 | 不 | 慢 | 地 | 说 | 伊 | 冯 | 娜 | 你 | 知道 | 如果 | 是 | 我们 | 之间 | 的 | 隐私 | 你 | 可以 | 叫 | 我 | 夏 | 尔 | 用不着 | 叫 | 我 | 上帝
You have the input text, sentence, paragraph whatever. So yes, your processing of it will need to query against your DB for each check.
With decent indexing on the word column though, you shouldn't have too many problems.
Having said that, how big is this dictionary? After all, you would only need the words, not their definitions to check whether it's a valid word. So if at all possible (depending on the size), having a huge memory map/hashtable/dictionary with just keys (the actual words) may be an option and would be quick as lightning.
At 15 million words, say average 7 characters @ 2 bytes each works out around the 200 Megabytes mark. Not too crazy.
Edit: At 'only' 1 million words, you're looking at around just over 13 Megabytes, say 15 with some overhead. That's a no-brainer I would say.
Thanks to everyone for you help!
After a little research I've found some working tools (having in mind all your suggestions), that's why I'm answering my own question.
A PHP class (http://www.phpclasses.org/browse/package/2431.html)
A Drupal module, basically another PHP solution with 4 different segmentation algorithms (pretty easy to understand how it works) (http://drupal.org/project/csplitter)
A PHP extension for Chinese word segmentation (http://code.google.com/p/phpcws/)
There are some other solutions availabe if you try searching baidu.com for "中文分词"
Sincerely,
Equ