How do I convert a Python 3 byte-string variable into a regular string?
Call decode()
on a bytes
instance to get the text which it encodes.
str = bytes.decode()
How to filter (skip) non-UTF8 charachers from array?
To address this comment in @uname01's post and the OP, ignore the errors:
Code
>>> b'\x80abc'.decode("utf-8", errors="ignore")
'abc'
Details
From the docs, here are more examples using the same errors
parameter:
>>> b'\x80abc'.decode("utf-8", "replace")
'\ufffdabc'
>>> b'\x80abc'.decode("utf-8", "backslashreplace")
'\\x80abc'
>>> b'\x80abc'.decode("utf-8", "strict")
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
invalid start byte
The errors argument specifies the response when the input string can’t be converted according to the encoding’s rules. Legal values for this argument are
'strict'
(raise aUnicodeDecodeError
exception),'replace'
(useU+FFFD
,REPLACEMENT CHARACTER
), or'ignore'
(just leave the character out of the Unicode result).
You had it nearly right in the last line. You want
str(bytes_string, 'utf-8')
because the type of bytes_string
is bytes
, the same as the type of b'abc'
.