Python - How to intuit word from abbreviated text using NLP?
If you cannot find an exhaustive dictionary, you could build (or download) a probabilistic language model, to generate and evaluate sentence candidates for you. It could be a character n-gram model or a neural network.
For your abbreviations, you can build a "noise model" which predicts probability of character omissions. It can learn from a corpus (you have to label it manually or half-manually) that consonants are missed less frequently than vowels.
Having a complex language model and a simple noise model, you can combine them using noisy channel approach (see e.g. the article by Jurafsky for more details), to suggest candidate sentences.
Update. I got enthusiastic about this problem and implemented this algorithm:
- language model (character 5-gram trained on the Lord of the Rings text)
- noise model (probability of each symbol being abbreviated)
- beam search algorithm, for candidate phrase suggestion.
My solution is implemented in this Python notebook. With trained models, it has interface like noisy_channel('bsktball', language_model, error_model)
, which, by the way, returns
{'basket ball': 33.5, 'basket bally': 36.0}
. Dictionary values are scores of the suggestions (the lower, the better).
With other examples it works worse: for 'wtrbtl' it returns
{'water but all': 23.7,
'water but ill': 24.5,
'water but lay': 24.8,
'water but let': 26.0,
'water but lie': 25.9,
'water but look': 26.6}
For 'bwlingbl' it gives
{'bwling belia': 32.3,
'bwling bell': 33.6,
'bwling below': 32.1,
'bwling belt': 32.5,
'bwling black': 31.4,
'bwling bling': 32.9,
'bwling blow': 32.7,
'bwling blue': 30.7}
However, when training on an appropriate corpus (e.g. sports magazines and blogs; maybe with oversampling of nouns), and maybe with more generous width of beam search, this model will provide more relevant suggestions.
So I've looked at a similar problem, and came across a fantastic package called PyEnchant. If you use the build in spell-checker you can get word suggestions, which would be a nice and simple solution. However it will only suggest single words (as far as I can tell), and so the situation you have:
wtrbtl = water bottle
Will not work.
Here is some code:
import enchant
wordDict = enchant.Dict("en_US")
inputWords = ['wtrbtl','bwlingbl','bsktball']
for word in inputWords:
print wordDict.suggest(word)
The output is:
['rebuttal', 'tribute']
['bowling', 'blinding', 'blinking', 'bumbling', 'alienable', 'Nibelung']
['basketball', 'fastball', 'spitball', 'softball', 'executable', 'basketry']
Perhaps if you know what sort of abbreviations there are you can separate the string into two words, e.g.
'wtrbtl' -> ['wtr', 'btl']
There's also the Natural Language Processing Kit (NLTK), which is AMAZING, and you could use this in combination with the above code by looking at how common each suggested word is, for example.
Good luck!