What is the Metaphone 3 Algorithm?

Since the author (Lawrence Philips) decided to commercialize the algorithm itself it is more than likely that you will not find description. The good place to ask would be the mailing list: https://lists.sourceforge.net/lists/listinfo/aspell-metaphone

but you can also checkout source code (i.e. the code comments) in order to understand how algorithm works: http://code.google.com/p/google-refine/source/browse/trunk/main/src/com/google/refine/clustering/binning/Metaphone3.java?r=2029


The link by @Bo now refers to (now defucnt) project entire source code.

Hence here is the new link with direct link to Source code for Metaphone 3 https://searchcode.com/codesearch/view/2366000/

by Lawrence Philips

Metaphone 3 is designed to return an approximate phonetic key (and an alternate * approximate phonetic key when appropriate) that should be the same for English * words, and most names familiar in the United States, that are pronounced similarly. * The key value is not intended to be an exact phonetic, or even phonemic, * representation of the word. This is because a certain degree of 'fuzziness' has * proven to be useful in compensating for variations in pronunciation, as well as * misheard pronunciations. For example, although americans are not usually aware of it, * the letter 's' is normally pronounced 'z' at the end of words such as "sounds".

The 'approximate' aspect of the encoding is implemented according to the following rules:

* * (1) All vowels are encoded to the same value - 'A'. If the parameter encodeVowels * is set to false, only initial vowels will be encoded at all. If encodeVowels is set * to true, 'A' will be encoded at all places in the word that any vowels are normally * pronounced. 'W' as well as 'Y' are treated as vowels. Although there are differences in * the pronunciation of 'W' and 'Y' in different circumstances that lead to their being * classified as vowels under some circumstances and as consonants in others, for the purposes * of the 'fuzziness' component of the Soundex and Metaphone family of algorithms they will * be always be treated here as vowels.

* * (2) Voiced and un-voiced consonant pairs are mapped to the same encoded value. This means that:
* 'D' and 'T' -> 'T'
* 'B' and 'P' -> 'P'
* 'G' and 'K' -> 'K'
* 'Z' and 'S' -> 'S'
* 'V' and 'F' -> 'F'

* * - In addition to the above voiced/unvoiced rules, 'CH' and 'SH' -> 'X', where 'X' * represents the "-SH-" and "-CH-" sounds in Metaphone 3 encoding.


From Wikipedia, the Metaphone algorithm is

Metaphone is a phonetic algorithm, an algorithm published in 1990 for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding, which does a better job of matching words and names which sound similar [...]

Metaphone 3 specifically

[...] achieves an accuracy of approximately 99% for English words, non-English words familiar to Americans, and first names and family names commonly found in the United States, having been developed according to modern engineering standards against a test harness of prepared correct encodings.

The overview of the algorithm is:

The Metaphone algorithm operates by first removing non-English letters and characters from the word being processed. Next, all vowels are also discarded unless the word begins with an initial vowel in which case all vowels except the initial one are discarded. Finally all consonents and groups of consonents are mapped to their Metaphone code. The rules for grouping consonants and groups thereof then mapping to metaphone codes are fairly complicated; for a full list of these conversions check out the comments in the source code section.

Now, onto your real question:

If you are interested in the specifics of the Metaphone 3 algorithm, I think you are out of luck (short of buying the source code, understanding it and re-creating it on your own): the whole point of not making the algorithm (of which the source you can buy is an instance) public is that you cannot recreate it without paying the author for their development effort (providing the "precise algorithm" you are looking for is equivalent to providing the actual code itself). Consider the above quotes: the development of the algorithm involved a "test harness of [...] encodings". Unless you happen to have such test harness or are able to create one, you will not be able to replicate the algorithm.

On the other hand, implementations of the first two iterations (Metaphone and Double Metaphone) are freely available (the above Wikipedia link contains a score of links to implementations in various languages for both), which means you have a good starting point in understanding what the algorithm is about exactly, then improve on it as you see fit (e.g. by creating and using an appropriate test harness).