Which characters are considered whitespace by split()?
Unfortunately, it depends on whether your string is an str
or a unicode
(at least, in CPython - I don't know whether this behavior is actually mandated by a specification anywhere).
If it is an str
, the answer is straightforward:
0x09
Tab0x0a
Newline0x0b
Vertical Tab0x0c
Form Feed0x0d
Carriage Return0x20
Space
Source: these are the characters with PY_CTF_SPACE
in Python/pyctype.c
, which are used by Py_ISSPACE
, which is used by STRINGLIB_ISSPACE
, which is used by split_whitespace
.
If it is a unicode
, there are 29 characters, which in addition to the above are:
U+001c
through0x001f
: File/Group/Record/Unit SeparatorU+0085
: Next LineU+00a0
: Non-Breaking SpaceU+1680
: Ogham Space MarkU+2000
through0x200a
: various fixed-size spaces (e.g. Em Space), but note that Zero-Width Space is not includedU+2028
: Line SeparatorU+2029
: Paragraph SeparatorU+202f
: Narrow No-Break SpaceU+205f
: Medium Mathematical SpaceU+3000
: Ideographic Space
Note that the first four are also valid ASCII characters, which means that an ASCII-only string might split differently depending on whether it is an str
or a unicode
!
Source: these are the characters listed in _PyUnicode_IsWhitespace
, which is used by Py_UNICODE_ISSPACE
, which is used by STRINGLIB_ISSPACE
(it looks like they use the same function implementations for both str
and unicode
, but compile it separately for each type, with certain macros implemented differently). The docstring describes this set of characters as follows:
Unicode characters having the bidirectional type 'WS', 'B' or 'S' or the category 'Zs'
The answer by Aasmund Eldhuset is what I was attempting to do but I was beaten to the punch. It shows a lot of research and should definitely be the accepted answer.
If you want confirmation of that answer (or just want to test it in a different implementation, such as a non-CPython one, or a later one which may use a different Unicode standard under the covers), the following short program will print out the actual characters that cause a split when using .split()
with no arguments.
It does this by constructing a string with the a
and b
characters(a) separated by the character being tested, then detecting if split
creates an array more than one element:
int_ch = 0
while True:
try:
test_str = "a" + chr(int_ch) + "b"
except Exception as e:
print(f'Stopping, {e}')
break
if len(test_str.split()) != 1:
print(f'0x{int_ch:06x} ({int_ch})')
int_ch += 1
The output (for my system) is as follows:
0x000009 (9)
0x00000a (10)
0x00000b (11)
0x00000c (12)
0x00000d (13)
0x00001c (28)
0x00001d (29)
0x00001e (30)
0x00001f (31)
0x000020 (32)
0x000085 (133)
0x0000a0 (160)
0x001680 (5760)
0x002000 (8192)
0x002001 (8193)
0x002002 (8194)
0x002003 (8195)
0x002004 (8196)
0x002005 (8197)
0x002006 (8198)
0x002007 (8199)
0x002008 (8200)
0x002009 (8201)
0x00200a (8202)
0x002028 (8232)
0x002029 (8233)
0x00202f (8239)
0x00205f (8287)
0x003000 (12288)
Stopping, chr() arg not in range(0x110000)
You can ignore the error at the end, that's just to confirm it doesn't fail until we've moved out of the valid Unicode area (code points 0x000000 - 0x10ffff
making up the seventeen planes).
(a) I'm hoping that no future version of Python ever considers a
or b
to be whitespace, as that would totally break this (and a lot of other) code.
I think the chances of that are rather slim, so it should be fine :-)