Python - pyparsing unicode characters
As a general rule, do not process encoded bytestrings: make them into proper unicode strings (by calling their .decode
method) as soon as possible, do all of your processing always on unicode strings, then, if you have to for I/O purposes, .encode
them back into whatever bytestring encoding you require.
If you're talking about literals, as it seems you are in your code, the "as soon as possible" is at once: use u'...'
to express your literals. In a more general case, where you're forced to do I/O in encoded form, it's immediately after input (just as it's immediately before output if you need to perform output in a specific encoded form).
I Was searching about french unicode chars and fall on this question. If you search french or other latin accents, with pyparsing 2.3.0
you can use:
>>> pp.pyparsing_unicode.Latin1.alphas
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ'
Pyparsing's printables
only deals with strings in the ASCII range of characters. You want printables in the full Unicode range, like this:
unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode)
if not unichr(c).isspace())
Now you can define trans
using this more complete set of non-space characters:
trans = Word(unicodePrintables)
I was unable to test against your Hindi test string, but I think this will do the trick.
(If you are using Python 3, then there is no separate unichr function, and no xrange generator, just use:
unicodePrintables = ''.join(chr(c) for c in range(sys.maxunicode)
if not chr(c).isspace())
EDIT:
With the recent release of pyparsing 2.3.0, new namespace classes have been defined to give printables
, alphas
, nums
, and alphanums
for various Unicode language ranges.
import pyparsing as pp
pp.Word(pp.pyparsing_unicode.printables)
pp.Word(pp.pyparsing_unicode.Devanagari.printables)
pp.Word(pp.pyparsing_unicode.देवनागरी.printables)