Reading Unicode file data with BOM chars in Python
BOM characters should be automatically stripped when decoding UTF-16, but not UTF-8, unless you explicitly use the utf-8-sig
encoding. You could try something like this:
import io
import chardet
import codecs
bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)
if raw.startswith(codecs.BOM_UTF8):
encoding = 'utf-8-sig'
else:
result = chardet.detect(raw)
encoding = result['encoding']
infile = io.open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()
print(data)
There is no reason to check if a BOM exists or not, utf-8-sig
manages that for you and behaves exactly as utf-8
if the BOM does not exist:
# Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
'hello'
>>> b'hello'.decode('utf-8-sig')
'hello'
# BOM encoded UTF-8
>>> b'\xef\xbb\xbfhello'.decode('utf-8')
'\ufeffhello'
>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')
'hello'
In the example above, you can see utf-8-sig
correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use utf-8-sig
and not worry about it