Python: Error tokenizing data. C error: Calling read(nbytes) on source failed with input nzip file
I didn't really find a python solution but using unix
tools I manage to find a solution:
First I use zless myfile.txt.gz > uncompressedMyfile.txt
then I use sed
tool to remove the last line because I clearly saw that last line was corrupt.
sed '$d' uncompressedMyfile.txt
I gzipped the file again gzip -k uncompressedMyfile.txt
I was able to successfully read the file with following python code:
try:
df = pd.read_csv(os.path.join(filePath, fileName),
sep='|', compression = 'gzip', dtype='unicode', error_bad_lines=False)
except CParserError:
print "Something wrong the file"
return df
Chances are the path you put is actually that of a folder
instead of the file
that needs to be read.
Pandas.read_csv
can't read folders and need explicit compatible file names.
The input zip file is corrupted. Get a proper copy of this file from the source of try to use zip repairing tools before you pass it along to pandas.