Loading special characters with PyYaml
Update
the latest version of pyyaml has fixed this bug, upgrade to pyyaml>=5
Original answer
This seems to be a bug in pyyaml, a workaround is to use their escape sequences:
$ cat test.yaml
- "\U0001f642"
- "\U0001f601"
- "\U0001f62c"
$ python
...
>>> yaml.load(open('test.yaml'))
['ð', 'ð', 'ð¬']
You should upgrade to ruamel.yaml
(disclaimer: I am the author of that package), which has this, and many other long standing PyYAML issues, fixed:
import sys
from ruamel.yaml import YAML
yaml = YAML()
with open('emojis.yml') as fp:
idx = 0
for c in fp.read():
print('{:08x}'.format(ord(c)), end=' ')
idx += 1
if idx % 4 == 0:
print()
with open('emojis.yml') as fp:
data = yaml.load(fp)
yaml.dump(data, sys.stdout)
gives:
0000002d 00000020 0001f642 0000000a
0000002d 00000020 0001f601 0000000a
0000002d 00000020 0001f62c 0000000a
['ð', 'ð', 'ð¬']
If you really have to stick with PyYAML, you can do:
import yaml.reader
import re
yaml.reader.Reader.NON_PRINTABLE = re.compile(
u'[^\x09\x0A\x0D\x20-\x7E\x85\xA0-\uD7FF\uE000-\uFFFD\U00010000-\U0010FFFF]')
to get rid of the error.
Starting with version 0.15.16, ruamel.yaml
now also dumps all supplementary plane Unicode without reverting to \Uxxxxxxxx
(controllable in the new API via .unicode_supplementary
, and depending on allow_unicode
).