UnicodeError: UTF-16 stream does not start with BOM
The problem is that your input file apparently doesn’t start with a BOM (a special character that gets recognizably encoded differently for little-endian vs. big-endian utf-16), so you can’t just use “utf-16” as the encoding, you have to explicitly use “utf-16-le
” or “utf-16-be
”.
If you don’t do that, codecs
will guess, and if it guesses wrong, it’ll try to read each code point backward and get illegal values.
If your posted sample starts at an even offset and contains a bunch of ASCII, it’s little-ending, so use the -le version. (But of course it’s better to look at what it actually is than to guess.)
After hours of struggling with such an issue, I came to learn that Excel exports data in Multiple CSV formats.
From Excel, please make sure to use 'CSV UTF-8 (Comma delimited)' option while exporting. (You often may want to use this type than the other CSV options).
Once you are sure of the UTF-type, in this case, 'UTF-8', go back to your python script and change encoding to 'UTF-8', though I found skipping this parameter also works.
with open('schools_dataset.csv', encoding='utf-8') as csv_file:
# continue opening the file
Now that you’ve included more of the file in your question, that isn’t a CSV file at all. My guess is that it’s an old-style binary XLS file, but that’s just a guess. If you’re just renaming spam.xls to spam.csv, you can’t do that; you need to export it to CSV format. (If you need help with that, ask on another site that offers help with Excel instead of with programming.)
If you can’t do that for some reason, there are libraries on PyPI to parse XLS files—but if you wanted CSV, and you can export CSV, that’s a better idea.