How do I check if a string is unicode or ascii?
How to tell if an object is a unicode string or a byte string
You can use type
or isinstance
.
In Python 2:
>>> type(u'abc') # Python 2 unicode string literal
<type 'unicode'>
>>> type('abc') # Python 2 byte string literal
<type 'str'>
In Python 2, str
is just a sequence of bytes. Python doesn't know what
its encoding is. The unicode
type is the safer way to store text.
If you want to understand this more, I recommend http://farmdev.com/talks/unicode/.
In Python 3:
>>> type('abc') # Python 3 unicode string literal
<class 'str'>
>>> type(b'abc') # Python 3 byte string literal
<class 'bytes'>
In Python 3, str
is like Python 2's unicode
, and is used to
store text. What was called str
in Python 2 is called bytes
in Python 3.
How to tell if a byte string is valid utf-8 or ascii
You can call decode
. If it raises a UnicodeDecodeError exception, it wasn't valid.
>>> u_umlaut = b'\xc3\x9c' # UTF-8 representation of the letter 'Ü'
>>> u_umlaut.decode('utf-8')
u'\xdc'
>>> u_umlaut.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Unicode is not an encoding - to quote Kumar McMillan:
If ASCII, UTF-8, and other byte strings are "text" ...
...then Unicode is "text-ness";
it is the abstract form of text
Have a read of McMillan's Unicode In Python, Completely Demystified talk from PyCon 2008, it explains things a lot better than most of the related answers on Stack Overflow.
In python 3.x all strings are sequences of Unicode characters. and doing the isinstance check for str (which means unicode string by default) should suffice.
isinstance(x, str)
With regards to python 2.x, Most people seem to be using an if statement that has two checks. one for str and one for unicode.
If you want to check if you have a 'string-like' object all with one statement though, you can do the following:
isinstance(x, basestring)
In Python 3, all strings are sequences of Unicode characters. There is a bytes
type that holds raw bytes.
In Python 2, a string may be of type str
or of type unicode
. You can tell which using code something like this:
def whatisthis(s):
if isinstance(s, str):
print "ordinary string"
elif isinstance(s, unicode):
print "unicode string"
else:
print "not a string"
This does not distinguish "Unicode or ASCII"; it only distinguishes Python types. A Unicode string may consist of purely characters in the ASCII range, and a bytestring may contain ASCII, encoded Unicode, or even non-textual data.