Python - pyparsing unicode characters

As a general rule, do not process encoded bytestrings: make them into proper unicode strings (by calling their .decode method) as soon as possible, do all of your processing always on unicode strings, then, if you have to for I/O purposes, .encode them back into whatever bytestring encoding you require.

If you're talking about literals, as it seems you are in your code, the "as soon as possible" is at once: use u'...' to express your literals. In a more general case, where you're forced to do I/O in encoded form, it's immediately after input (just as it's immediately before output if you need to perform output in a specific encoded form).

I Was searching about french unicode chars and fall on this question. If you search french or other latin accents, with pyparsing 2.3.0 you can use:

>>> pp.pyparsing_unicode.Latin1.alphas
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ'

Pyparsing's printables only deals with strings in the ASCII range of characters. You want printables in the full Unicode range, like this:

unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode) 
                                        if not unichr(c).isspace())

Now you can define trans using this more complete set of non-space characters:

trans = Word(unicodePrintables)

I was unable to test against your Hindi test string, but I think this will do the trick.

(If you are using Python 3, then there is no separate unichr function, and no xrange generator, just use:

unicodePrintables = ''.join(chr(c) for c in range(sys.maxunicode) 
                                        if not chr(c).isspace())

EDIT:

With the recent release of pyparsing 2.3.0, new namespace classes have been defined to give printables, alphas, nums, and alphanums for various Unicode language ranges.

import pyparsing as pp
pp.Word(pp.pyparsing_unicode.printables)
pp.Word(pp.pyparsing_unicode.Devanagari.printables)
pp.Word(pp.pyparsing_unicode.देवनागरी.printables)

Python - pyparsing unicode characters

Tags:

Python

Unicode

Nlp

Pyparsing

Related

Recent Posts