shlex.split still not supporting unicode?

The shlex.split() code wraps both unicode() and str() instances in a StringIO() object, which can only handle Latin-1 bytes (so not the full unicode codepoint range).

You'll have to encode (to UTF-8 should work) if you still want to use shlex.split(); the maintainers of the module meant that unicode() objects are supported now, just not anything outside the Latin-1 range of codepoints.

Encoding, splitting, decoding gives me:

>>> map(lambda s: s.decode('UTF8'), shlex.split(command_full.encode('utf8')))
[u'software.py', u'-fileA=sequence.fasta', u'-fileB=\u65b0\u5efa\u6587\u672c\u6587\u6863.fasta.txt', u'-output_dir=...', u'-FORMtitle=tst']

A now closed Python issue tried to address this, but the module is very byte-stream oriented, and no new patch has materialized. For now using iso-8859-1 or UTF-8 encoding is the best I can come up with for you.

Actually there's been a patch for over five years. Last year I got tired of copying a ushlex around in every project and put it on PyPI:

https://pypi.python.org/pypi/ushlex/

shlex.split still not supporting unicode?

Tags:

Python

Unicode

Shlex

Python Unicode

Related

Recent Posts