Python: Error tokenizing data. C error: Calling read(nbytes) on source failed with input nzip file

I didn't really find a python solution but using unix tools I manage to find a solution:

First I use zless myfile.txt.gz > uncompressedMyfile.txt then I use sed tool to remove the last line because I clearly saw that last line was corrupt.

sed '$d' uncompressedMyfile.txt

I gzipped the file again gzip -k uncompressedMyfile.txt

I was able to successfully read the file with following python code:

try:
    df = pd.read_csv(os.path.join(filePath, fileName),
                        sep='|', compression = 'gzip', dtype='unicode', error_bad_lines=False)
except CParserError:
    print "Something wrong the file"
return df

Chances are the path you put is actually that of a folder instead of the file that needs to be read.

Pandas.read_csv can't read folders and need explicit compatible file names.


The input zip file is corrupted. Get a proper copy of this file from the source of try to use zip repairing tools before you pass it along to pandas.

Tags:

Python

Pandas