UnicodeDecodeError when reading CSV file in Pandas with Python
Simplest of all Solutions:
import pandas as pd
df = pd.read_csv('file_name.csv', engine='python')
Alternate Solution:
Sublime Text:
- Open the csv file in Sublime text editor or VS Code.
- Save the file in utf-8 format.
- In sublime, Click File -> Save with encoding -> UTF-8
VS Code:
In the bottom bar of VSCode, you'll see the label UTF-8. Click it. A popup opens. Click Save with encoding. You can now pick a new encoding for that file.
Then, you could read your file as usual:
import pandas as pd
data = pd.read_csv('file_name.csv', encoding='utf-8')
and the other different encoding types are:
encoding = "cp1252"
encoding = "ISO-8859-1"
Pandas allows to specify encoding, but does not allow to ignore errors not to automatically replace the offending bytes. So there is no one size fits all method but different ways depending on the actual use case.
You know the encoding, and there is no encoding error in the file. Great: you have just to specify the encoding:
file_encoding = 'cp1252' # set file_encoding to the file encoding (utf8, latin1, etc.) pd.read_csv(input_file_and_path, ..., encoding=file_encoding)
You do not want to be bothered with encoding questions, and only want that damn file to load, no matter if some text fields contain garbage. Ok, you only have to use
Latin1
encoding because it accept any possible byte as input (and convert it to the unicode character of same code):pd.read_csv(input_file_and_path, ..., encoding='latin1')
You know that most of the file is written with a specific encoding, but it also contains encoding errors. A real world example is an UTF8 file that has been edited with a non utf8 editor and which contains some lines with a different encoding. Pandas has no provision for a special error processing, but Python
open
function has (assuming Python3), andread_csv
accepts a file like object. Typical errors parameter to use here are'ignore'
which just suppresses the offending bytes or (IMHO better)'backslashreplace'
which replaces the offending bytes by their Python’s backslashed escape sequence:file_encoding = 'utf8' # set file_encoding to the file encoding (utf8, latin1, etc.) input_fd = open(input_file_and_path, encoding=file_encoding, errors = 'backslashreplace') pd.read_csv(input_fd, ...)
read_csv
takes an encoding
option to deal with files in different formats. I mostly use read_csv('file', encoding = "ISO-8859-1")
, or alternatively encoding = "utf-8"
for reading, and generally utf-8
for to_csv
.
You can also use one of several alias
options like 'latin'
or 'cp1252'
(Windows) instead of 'ISO-8859-1'
(see python docs, also for numerous other encodings you may encounter).
See relevant Pandas documentation, python docs examples on csv files, and plenty of related questions here on SO. A good background resource is What every developer should know about unicode and character sets.
To detect the encoding (assuming the file contains non-ascii characters), you can use enca
(see man page) or file -i
(linux) or file -I
(osx) (see man page).