How can I determine Levenshtein distance for Mandarin Chinese characters?
Firstly, just to clarify: A Chinese character is not as such equivalent to a German or English word. Most of the things you'd consider as words (using a semantic or syntactic definition of "word") consist of 1-3 characters. It is straightforward to apply Levenshtein distance to such character sequences by representing them as sequences of UCS-2 or UCS-4 code points. As most words are short (esp. words of length 1 or 2 characters), it may be of limited use, though.
However, as your question is specifically about the edit distance between individual characters, I believe a different approach is required, and it may be very difficult indeed.
For a start, you'd have to represent each character as a sequence of the components / strokes it consists of. There are two problems:
Some components consist themselves of even smaller components, so how to break a character down into "atomic" components is not uniquely defined. If you do it down to the level of individual strokes, you'd need a characterisation of every single stroke (position within the character, shape, direction etc.). I don't think anyone as every done this (I'd be most interested if anyone tells me otherwise).
You'd need to put the strokes or components into an order. The obvious candidate is the canonical stroke order of the character, which is described in lexica, and there are even dictionary websites with animated stroke order diagrams. However, the data sources I know (for Japanese), generate these animations as sequences of bitmap graphics; I have never seen human or machine readable codes that represent the sequence of strokes (or even the names of individual strokes) in a form that is suitable for edit distance calculation.
One final thing you could try, though, is to render the character glyphs and calculate the edit distance based on how many pixels (or vectors) need to be changed to turn one character into another. I once did this for Latin characters and character combinations (on pixel basis) in the context of OCR post-correction, and the results were quite encouraging.
A quick answer to larsmans comment below: There are two related concepts defined by the Unicode Standard (in the below I refer to the 6.0 version, chapter 12):
An index based on radicals and stroke counts. Each Han character consists of several components, one of which is the radical. A radical/stroke count index is a character list sorted by radical (i.e. all characters that share the same radical grouped together), and each radical-specific group internally sorted by the number of strokes used in the rest of the character. Unfortunately, even this is not uniquely defined – there are characters whose radical is defined differently by different traditional lexica, and stroke counting can also be difficult. Here is what the Unicode Standard says:
To expedite locating specific Han ideographic characters in the code charts, radical-stroke indices are provided on the Unicode web site. [...] The most influential authority for radical-stroke information is the eighteenth-century KangXi dictionary, which contains 214 radicals. The main problem in using KangXi radicals today is that many simplified characters are difficult to classify under any of the 214 KangXi radicals. As a result, various modern radical sets have been introduced. None, however, is in general use, and the 214 KangXi radicals remain the best known. [...] The Unicode radical-stroke charts are based on the KangXi radicals. The Unicode Standard follows a number of different sources for radical-stroke classification. Where two sources are at odds as to radical or stroke count for a given character, the character is shown in both positions in the radical-stroke charts.
Note that even if we assume the radical/stroke index to be unambiguous and correct, it wouldn't suffice as a source of information to transform a character into a sequence of components, because the only component of the character fully described by this is the radical.
Ideographic description sequences (section 12.2): Unicode defines code points for the basic components of characters (most of them can themselves be used as standalone characters anyway), and there are codepoints used to glue those together to form a sequence of components that describes the composition of a more complex character. So this works in a way similar to combining characters, but there are important differences:
- The order of components is not uniquely defined
- There is no definition of a rendering mechanism for such sequences
- There is no mapping from ordinary characters to corresponding ideographic description sequences (although the Standard mentions that such mappings, to some extent, exist in the sources they used to compile the Han character set).
The Standard suggests that ideographic description sequences be used to describe complex or rare charactes that are not represented by any existing code point; but it explicitly discourages the use of description sequences in place of ordinary characters:
In particular, Ideographic Description Sequences should not be used to provide alternative graphic representations of encoded ideographs in data interchange. Searching, collation, and other content-based text operations would then fail.
I wrote a python package fuzzychinese
to correct misspellings of Chinese words.
As @jogojapan has said, if you really want to calculate Levenshtein distance, it makes more sense to use sub-character structures such as radicals or strokes. You can use the Stroke()
or Radical()
classes from fuzzychinese
to break down characters, and then calculate Levenshtein distance.
However, I am not sure Levenshtein distance works well for correcting misspelling Chinese words. In the package I wrote, I calculated tf–idf vector for n-gram strokes and used cosine similarity to match words.