Reading data from CSV into dataframe with multiple delimiters efficiently

If this is an option, substituting the character ; with , in the string is faster. I have written the string x to a file test.dat.

def csv_reader_4(x):
    with open(x, 'r') as f:
        a = f.read()
    return pd.read_csv(StringIO(unicode(a.replace(';', ','))), usecols=[3, 4, 5])

The unicode() function was necessary to avoid a TypeError in Python 2.

Benchmarking:

%timeit csv_reader_2('test.dat')  # 1.6 s per loop
%timeit csv_reader_4('test.dat')  # 1.2 s per loop

How about using a generator to do the replacement, and combining it with an appropriate decorator to get a file-like object suitable for pandas?

import io
import pandas as pd

# strings in first 3 columns are of arbitrary length
x = '''ABCD,EFGH,IJKL,34.23;562.45;213.5432
MNOP,QRST,UVWX,56.23;63.45;625.234
'''*10**6

def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
    """
    http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
    Lets you use an iterable (e.g. a generator) that yields bytestrings as a
    read-only input stream.

    The stream implements Python 3's newer I/O API (available in Python 2's io
    module).

    For efficiency, the stream is buffered.
    """
    class IterStream(io.RawIOBase):
        def __init__(self):
            self.leftover = None
        def readable(self):
            return True
        def readinto(self, b):
            try:
                l = len(b)  # We're supposed to return at most this much
                chunk = self.leftover or next(iterable)
                output, self.leftover = chunk[:l], chunk[l:]
                b[:len(output)] = output
                return len(output)
            except StopIteration:
                return 0    # indicate EOF
    return io.BufferedReader(IterStream(), buffer_size=buffer_size)

def replacementgenerator(haystack, needle, replace):
    for s in haystack:
        if s == needle:
            yield str.encode(replace);
        else:
            yield str.encode(s);

csv = pd.read_csv(iterstream(replacementgenerator(x, ";", ",")), usecols=[3, 4, 5])

Note that we convert the string (or its constituent characters) to bytes through str.encode, as this is required for use by Pandas.

This approach is functionally identical to the answer by Daniele except for the fact that we replace values "on-the-fly", as they are requested instead of all in one go.

Use a command-line tool

By far the most efficient solution I've found is to use a specialist command-line tool to replace ";" with "," and then read into Pandas. Pandas or pure Python solutions do not come close in terms of efficiency.

Essentially, using CPython or a tool written in C / C++ is likely to outperform Python-level manipulations.

For example, using Find And Replace Text:

import os

os.chdir(r'C:\temp')                       # change directory location
os.system('fart.exe -c file.csv ";" ","')  # run FART with character to replace

df = pd.read_csv('file.csv', usecols=[3, 4, 5], header=None)  # read file into Pandas

A very very very fast one, 3.51 is the result, simply just make csv_reader_4 the below, it simply converts StringIO to str, then replaces ; with ,, and reads the dataframe with sep=',':

def csv_reader_4(x):
    with x as fin:
        reader = pd.read_csv(StringIO(fin.getvalue().replace(';',',')), sep=',',header=None)
    return reader

The benchmark:

%timeit csv_reader_4(StringIO(x)) # 3.51 s per loop

Reading data from CSV into dataframe with multiple delimiters efficiently

Use a command-line tool

Tags:

Python

Pandas

Performance

Csv

Dataframe

Related

Recent Posts