Parsing natural language ingredient quantities for recipes

You pose two problems, recognizing/extracting the quantity expressions (syntax) and figuring out what amount they mean (semantics).

Before you figure out whether regexps are enough to recognize the quantities, you should make yourself a good schema (grammar) of what they look like. Your examples look like this:

<amount> <unit> [of <ingredient>]

where <amount> can take many forms:

whole or decimal number, in digits (250, 0.75)
common fraction (3/4)
numeral in words (half, one, ten, twenty-five, three quarters)
determiner instead of a numeral ("an onion")
subjective (some, a few, several)

The amount can also be expressed as a range of two simple <amount>s:

two to three
2 to 3
2-3
five to 10

Then you have the units themselves:

general-purpose measurements (lb, oz, kg, g; pounds, ounces, etc.)
cooking units (Tb, tsp)
informal units (a pinch, a dash)
container sizes (package, bunch, large can)
no unit at all, for countable ingredients (as in "three lemons")

Finally, there's a special case of expressions that can never be combined with either amounts or units, so they effectively function as a combination of both:

a little
to taste

I'd suggest approaching this as a small parser, which you can make as detailed or as rough as you need to. It shouldn't be too hard to write regexps for all of those, if that's your tool of choice, but as you see it's not just a question of textual substitution. Pull the parts out and represent each ingredient as a triple (amount, unit, ingredient). (For countables, use a special unit "pieces" or whatever; for "a little" and the like, I'd treat them as special units).

That leaves the question of converting or comparing the quantities. Unit conversion has been done in lots of places, so at least for the official units you should have no trouble getting the conversion tables. Google will do it if you type "convert 4oz to grams", for example. Note that a Tbsp is either three or four tsp, depending on the country.

You can standardize to your favorite units pretty easily for well-defined units, but the informal units are a little trickier. For "a pinch", "a dash", and the like, I would suggest finding out the approximate weight so that you can scale properly (ten pinches = 2 grams, or whatever). Cans and the like are hopeless, unless you can look up the size of particular products.

On the other hand, subjective amounts are the easiest: If you scale up "to taste" ten times, it's still "to taste"!

One last thought: Some sort of database of ingredients is also needed for recognizing the main ingredients, since size matters: "One egg" is probably not the major ingredient, but "one small goat, quartered" may well be. I would consider it for version 2.


Regular expressions are difficult to get right for natural language parsing. NLTK, like you mentioned, would probably be a good option to look into otherwise you'll find yourself going around in circles trying to get the expressions right.

If you want something of the Ruby variety instead of NLTK, take a look at Treat:

https://github.com/louismullie/treat

Also, the Linguistics framework might be a good option as well:

http://deveiate.org/projects/Linguistics

EDIT:

I figured there had to already be a Ruby recipe parser out there, here's another option you might want to look into:

https://github.com/iancanderson/ingreedy

Tags:

Ruby

Regex

Nlp